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Abstract: These last years, while many efforts have been made to prove that 
the Lasso behaves like a variable selection procedure at the price of strong as- 
sumptions on the geometric structure of these variables, much less attention has 
been paid to the analysis of the performance of the Lasso as a regularization 
algorithm. Our first purpose here is to provide a result in this direction by 
proving that the Lasso works almost as well as the deterministic Lasso provided 
that the regularization parameter is properly chosen. This result does not re- 
quire any assumption at all, neither on the structure of the variables nor on the 
regression function. 

Our second purpose is to introduce a new estimator particularly adapted 
to deal with infinite countable dictionaries. This estimator is constructed as 
an ^Q-penalized estimator among a sequence of Lasso estimators associated to 
a dyadic sequence of growing truncated dictionaries. The selection procedure 
automatically chooses the best level of truncation of the dictionary so as to 
make the best tradeoff between approximation, £i-regularization and sparsity. 
From a theoretical point of view, we shall provide an oracle inequality satisfied 
by this selected Lasso estimator. 

All the oracle inequalities presented in this paper are obtained via the ap- 
plication of a single general theorem of model selection among a collection of 
nonlinear models. The key idea that enables us to apply this general theorem 
is to see £i-regularization as a model selection procedure among £i-balls. 

Finally, rates of convergence achieved by the Lasso and the selected Lasso 
estimators on a wide class of functions are derived from these oracle inequalities, 
showing that these estimators perform at least as well as greedy algorithms. 

Key-words: Lasso, i!i-oracle inequalities. Model selection by penalization, 
^i-balls. Generalized linear Gaussian model. 
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Une inegalite oracle ii pour le Lasso 

Resume : Ces dernieres annees, alors que de nombreux efforts ont ete faits 
pour prouver que le Lasso agit comme une procedure de selection de variables 
au prix d'hypotheses contraignantes sur la structure geometrique de ces vari- 
ables, peu de travaux analysant la performance du Lasso en taut qu'algorithme 
de regularisation £i ont ete realises. Notre premier objectif est de fournir un 
resultat dans cette voie en prouvant que le Lasso se comporte presquc aussi 
bien que le Lasso deterministe a condition que le parametre de regularisation 
soit bien choisi. Ce resultat ne necessite aucune hypothese, ni sur la structure 
des variables, ni sur la fonction de regression. 

Notre second objectif est de contruire un nouvel estimateur particulierement 
adapte a I'utilisation de dictionnaires infinis. Get estimateur est construit par 
penalisation Eq d'une suite d'estimateurs Lasso associes a une suite dyadique 
croissante de dictionnaires tronques. L'algorithme correspondant choisit automa- 
tiquement le niveau de troncature garantissant le meilleur compromis entre ap- 
proximation, regularisation ii et parcimonie. D'un point de vue theorique, nous 
etablissons une inegalite oracle satisfaite par cet estimateur. 

Toutes les incgalitcs oracles presentees dans cet article sont obtenues en 
appliquant un thcorcmc dc selection de modeles parmi un ensemble de modeles 
non lincaircs, grace a I'idee cle qui consiste a envisager la regularisation £i 
comme une procedure de selection de modeles parmi des boules ii. 

Enfin, nous deduisons de ces inegalites oracles des vitesses de convergence 
sur de larges classes de fonctions montrant en particulier que les estimateurs 
Lasso sont aussi performants que les algorithmes greedy. 

Mots-cles : Lasso, inegalites oracles ii, selection de modeles par penalisation, 
boules £i, modeles lineaires gaussiens generalises. 
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1 Introduction 

We consider the problem of estimating a regression function / belonging to a 
Hilbert space IH in a fairly general Gaussian framework which includes the fixed 
design regression or the white noise frameworks. Given a dictionary V = {(/)j} 

of functions in H, we aim at constructing an estimator / = 6.(j) := ^ 6j (f)j of / 
which enjoys both good statistical properties and computational performance 
even for large or infinite dictionaries. 

For high-dimensional dictionaries, direct minimization of the empirical risk 
can lead to overfitting and we need to add a complexity penalty to avoid it. 
One could use an €o-penalty, i.e. penalize the number of non-zero coefficients 6j 
of / (see [4] for instance) so as to produce interpretable sparse models but there 
is no efficient algorithm to solve this non-convex minimization problem when 
the size of the dictionary becomes too large. On the contrary, £i-penalization 
leads to convex optimization and is thus computationally feasible even for high- 
dimensional data. Moreover, due to its geometric properties, £i-penalty tends 
to produce some coefficients that are exactly zero and hence often behaves like 
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an ^Q-pcnalty. These arc the main motivations for introducing i!i-pcnahzation 
rather than other penahzations. 

In the Hncar regression framework, the idea of i!i-penahzation was first in- 
troduced by Tibshirani [1(S] who considered the so-caUed Lasso estimator (Least 
Absolute Shrinkage and Selection Operator). Then, lots of studies on this 
estimator have been carried out, not only in the linear regression framework 
but also in the nonparametric regression setup with quadratic or more general 
loss functions (see [3], [14], [19] among others). In the particular case of the 
fixed design Gaussian regression models, if we observe n i.i.d. random couples 
{xi,Yi),. . . , (x„,y„) such that 

Y, = f{x,)+<jC,, i = l,...,n, (1.1) 

and if we consider a dictionary Vp — {i/ii, . . . , 0^} of size p, the Lasso estimator 
is defined as the following ^i -penalized least squares estimator 

fp := fp{\) = argmin \\Y - hf + Xp\\h\\c,iv,), (1-2) 

heCi(Vp) 

where \\Y — /ijp := X^iLi O^i ~ h{xi)) /n is the empirical risk of /i, Ci{T)p) 
is the linear span of Vp equipped with the ^i-norm ||/i||£j(x) ) := inf{||6'||i = 
X]^'=i \^j\ '■ ^ ^ ^■'P ~ S?=i ^j 't'j} ^'^d Ap > is a regularization parameter. 

Since £i-penalization can be seen as a "convex relaxation" of €o-penalization, 
many efforts have been made to prove that the Lasso behaves like a variable 
selection procedure by establishing sparsity oracle inequalities showing that the 
^i-solution mimicks the "^o-oracle" (see for instance [3] for the prediction loss in 
the case of the quadratic nonparametric Gaussian regression model). Nonethe- 
less, all these results require strong restrictive assumptions on the geometric 
structure of the variables. We refer to [()] for a detailed overview of all these 
restrictive assumptions. 

In this paper, we shall explore another approach by analyzing the perfor- 
mance of the Lasso as a regularization algorithm rather than a variable selection 
procedure. This shall be done by providing an ^i-oracle type inequality satis- 
fied by this estimator (see Theorem 3.2). In the particular case of the fixed 
design Gaussian regression model, this result says that if Vp = {(j)i, . . . ,(f)p} 

with inaxj=i p \\4>j\\ < 1, then there exists an absolute constant C > such 

that for all Xp > 4cr7i^^/^(^/lnp + 1), the Lasso estimator defined by (1.2) 
satisfies 



E 



\.f - fpW^ +-^pll/pll£l(I3p) 



<C 



inf (||/-/»f + Ap!l/i|U,(P^;^ ' ^ 



h<£Ci{Vp) ' " "' "" y/n 

(1.3) 
This simply means that, provided that the regularization parameter Xp is pro- 
perly chosen, the Lasso estimator works almost as well as the deterministic 
Lasso. Notice that, unlike the sparsity oracle inequalities, the above result does 
not require any assumption neither on the target function / nor on the structure 
of the variables (jjj of the dictionary 2?p, except simple normalization that we 
can always assume by considering ^j/jl^jll instead of (j)j. This fi -oracle type 
inequality is not entirely new. Indeed, on the one hand, Barron and al. [0] have 
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provided a similar risk bound but in the case of a truncated Lasso estimator 
under the assumption that the target function is bounded by a constant. On 
the other hand, Rigollet and Tsybakov [J (>] are proposing a result with the same 
flavour but with the subtle difference that it is expressed as a probability bound 
which does not imply (1.3) (see a more detailed explanation in Section 3.2). 

We shall derive (1.3) from a fairly general model selection theorem for non 
linear models, interpreting i!i-regularization as an i!i-balls model selection cri- 
terion (see Section 7). This approach will allow us to go one step further than 
the analysis of the Lasso estimator for finite dictionaries. Indeed, we can deal 
with infinite dictionaries in various situations. 

In the second part of this paper, we shall thus focus on infinite countable 
dictionaries. The idea is to order the variables of the infinite dictionary T) 
thanks to the a priori knowledge we can have of these variables, then write 
the dictionary V = {(/)j}jgN* = {0i, 02, • • • } according to this order, and con- 
sider the dyadic sequence of truncated dictionaries Vi C ■ ■ ■ C Vp C ■ ■ ■ C V 
where Vp = {(j)i, . . . ,(j)p} for p G {S"^, J G N}. Given this sequence (T>p) ^ 

we introduce an associated sequence of Lasso estimators {fp)p with regulariza- 
tion parameters Ap depending on p, and choose fp as an £o-penalized estimator 
among this sequence by penalizing the size of the truncated dictionaries Pp. 
This selected Lasso estimator fp is thus based on an algorithm choosing auto- 
matically the best level of truncation of the dictionary and is constructed to 
make the best tradeoff between approximation, ^i-regularization and sparsity. 
From a theoretical point of view, we shall establish an oracle inequality satisfied 
by this selected Lasso estimator. Of course, although introduced for infinite 
dictionaries, this estimator remains well defined for finite dictionaries and it 
may be profitable to exploit its good properties and to use it rather than the 
classical Lasso for such dictionaries. 

In a third part of this paper, we shall focus on the rates of convergence of the 
sequence of the Lassos and the selected Lasso estimator introduced above. We 
shall provide rates of convergence of these estimators for a wide range of func- 
tion classes described by mean of interpolation spaces Bq^r that are adapted to 
the truncation of the dictionary and constitute an extension of the intersection 
between weak-£g spaces and Besov spaces SJo^ for non orthonormal dictiona- 
ries. Our results will prove that the Lasso estimators fp for p large enough and 
the selected Lasso estimator fp perform as well as the greedy algorithms de- 
scribed by Barron and al. in [1]. Besides, our convergence results shall highlight 
the advantage of using the selected Lasso estimator rather than Lassos. Indeed, 
we shall prove that the Lasso estimators fp, like the greedy algorithms in [1], 
are efficient only for p large enough compared to the unknown parameters of 
smoothness of / whereas fp always achieves good rates of convergence whenever 
the target function / belongs to some interpolation space Bq_r- In particular, we 
shall check that these rates of convergence are optimal by establishing a lower 
bound of the minimax risk over the intersection between Cq spaces and Besov 
spaces B2oa i^i ^^e orthonormal case. 

We shall end this paper by providing some theoretical results on the perfor- 
mance of the Lasso for particular infinite uncountable dictionaries such as those 
used for neural networks. Although Lasso solutions can not be computed in 
practice for such dictionaries, our purpose is just to point out the fact that the 
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Lasso theoretically performs as well as the greedy algorithms in [1], by establi- 
shing rates of convergence based on an £i-oraele type inequality similar to (1.3) 
satisfied by the Lasso for such dictionaries. 

The article is organized as follows. The notations and the generalized linear 
Gaussian framework in which we shall work throughout the paper are introduced 
in Section 2. In Section 3, we consider the case of finite dictionaries and analyze 
the performance of the Lasso as a rcgularization algorithm by providing an 
£i-oracle type inequality which highlights the fact that the Lasso estimator 
works almost as well as the deterministic Lasso provided that the regularization 
parameter is large enough. In section 4, we study the case of infinite counta- 
ble dictionaries and establish a similar oracle inequality for the selected Lasso 
estimator fp. In section 5, we derive from these oracle inequalities rates of 
convergence of the Lassos and the selected Lasso estimator for a variety of 
function classes. Some theoretical results on the performance of the Lasso for the 
infinite uncountable dictionaries used to study neural networks in the artificial 
intelligence field are mentioned in Section 6. Finally, Section 7 is devoted to the 
explanation of the key idea that enables us to derive all our oracle inequalities 
from a single general model selection theorem and to the statement of this 
general theorem. The proofs are postponed until Section 8. 



2 Models and notations 

2.1 General framework and statistical problem 

Let us first describe the generalized linear Gaussian model we shall work with. 
We consider a separable Hilbert space H equipped with a scalar product (., .) 
and its associated norm 1 1 . 1 1 . 

Definition 2.1. [Isonormal Gaussian process] A centered Gaussian process 
{W{h))heu 'is isonormal if its covariance is given by K[W{g)W{h)] = (g, h) for 
all g, /i G H. 

The statistical problem we consider is to approximate an unknown target 
function / in H when observing a process {Y{h))i^^^ defined by 

Y{h)^{fJi)+eW{h), hem, (2.1) 

where e > is a fixed parameter and W is an isonormal process. This frame- 
work is convenient to cover both finite-dimensional models and the infinite- 
dimensional white noise model as described in the following examples. 

Example 2.2. [Fixed design Gaussian regression model] Let ^ be a 

measurable space. One observes n i.i.d. random couples {xi, Yi), . . . , (a;„, K„) of 
X xR such that 

Y^^f{x,) + <Ji,, ^ = l,...,n, (2.2) 

where the covariates xi, . . . ,Xn are deterministic elements of X, the errors £,i are 
i.i.d. A/'(0, 1), cr > and / : A" i— >■ R is the unknown regression function to be 
estimated. If one considers EI = R" equipped with the scalar product {u, v) = 
Z]r=i ^i ""*/"-' defines y = (Yi, . . . , F„)^, ^ = (6, • ■ • , ^nV and improperly de- 
notes h = {h{xi), . . . ,h{xn)) for every h : X i-^ M., then W{h) := •\/ri(^, /i) 
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defines an isonormal Gaussian process on H and Y{h) := {y,h) satisfies (2.1) 

with e := a j \/n. 

Let us notice that 



\ 



1 " 

-Y.h\x,) (2.3) 



n 



corresponds to the L2-norm with respect to the measure v^ '■= X]i=i ^xi/n with 
5u the Dirac measure at u. It depends on the sample size n and on the training 
sample via xi, . . . ,Xn but we omit this dependence in notation (2.3). 

Example 2.3. [The white noise framework] In this case, one observes ({x) 
for X €z [0,1] given by the stochastic differential equation 

dC{x) = fix) dx + e dB{x) with C(0) = 0, 

where B is a standard Brownian motion, / is a square-integrable function and 
£ > 0. If we define W{h) = J^h{x)dB{x) for every h g L2([0, 1]), then W 
is an isonormal process on H = L2([0, 1]), and Y{h) — J^ h{x)dC,{x) obeys 
to (2.1) provided that H is equipped with its usual scalar product (/, ft.) — 
L f(x)h{x)dx. Typically, / is a signal and dC,{x) represents the noisy signal 
received at time x. This framework easily extends to a d-dimensional setting 
if one considers some multivariate Brownian sheet B on [0, 1]'' and takes M = 
L2([0,l]'^). 

2.2 Penalized least squares estimators 

To solve the general statistical problem (2.1), one can consider a dictionary V, 
i.e. a given finite or infinite set of functions (j)j G H that arise as candidate 
basis functions for estimating the target function /, and construct an estimator 
/ = 6.(f) := J2j (h-ev ^3 ^j ^^ ^^^ linear span of V. All the matter is to choose a 
"good" linear combination in the following meaning. It makes sense to aim at 
constructing an estimator as the best approximating point of / by minimizing 
11/ — h\\ or, equivalently, — 2(/, /i) + ||/i|p. However / is unknown, so one may 
instead minimize the empirical least squares criterion 

^{h):^-2Y{h) + \\hf. (2.4) 

But since we are mainly interested in very large dictionaries, direct minimization 
of the empirical least squares criterion can lead to overfitting. To avoid it, one 
can rather consider a penalized risk minimization problem and consider 

/ e ar g mm "f{h) + pen(/i), (2-5) 

h 

where pen(/i) is a positive penalty to be chosen. Finally, since the resulting 
estimator / depends on the observations, its quality can be measured by its 
quadratic risk E[|l/ - /f]. 

The penalty pen {h) can be chosen according to the statistical target. In the 
recent years, the situation where the number of variables (j)j can be very large (as 
compared to e~^) has received the attention of many authors due to the increa- 
sing number of applications for which this can occur. Micro-array data analysis 
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or signal reconstruction from a dictionary of redundant wavelet functions arc 
typical examples for which the number of variables either provided by Nature 
or considered by the statistician is large. Then, an interesting target is to select 
the set of the "most significant" variables 4>j among the initial collection. In 
this case, a convenient choice for the penalty is the i?o-penalty that penalizes the 
number of non-zero coefficients 9j of /, thus providing sparse estimators and 
interpretable models. Nonetheless, except when the functions (j)j are orthonor- 
mal, there is no efficient algorithm to solve this minimization problem in practice 
when the dictionary becomes too large. On the contrary, ^i-pcnalization, that 

is to say pen(/i) ex ||/i||£i(r.) := inf |||6'||i = J2j,4,^ev \^j\ such that h = 6.4>\, 
leads to convex optimization and is thus computationally feasible even for high- 
dimensional data. Moreover, due to its geometric properties, £i-penalty tends 
to produce some coefficients that are exactly zero and thus often behaves like an 
^o-penalty, hence the popularity of £i-penalization and its associated estimator 
the Lasso defined by 

/(A) = argmin -f{h) + X\\h\\c,(v), A > 0, 
heCi{V) 

where CilV) denotes the set of functions h in the linear span of V with finite 
£i-norm \\h\\c,{v)- 

3 The Lasso for finite dictionaries 

While many efforts have been made to prove that the Lasso behaves like a varia- 
ble selection procedure at the price of strong (though unavoidable) assumptions 
on the geometric structure of the dictionary (see [3] or [G] for instance), much 
less attention has been paid to the analysis of the performance of the Lasso as 
a regularization algorithm. The analysis we propose below goes in this very 
direction. In this section, we shall consider a finite dictionary Dp of size p and 
provide an ^i-oracle type inequality bounding the quadratic risk of the Lasso 
estimator by the infimum over Ci(T>p) of the tradeoff between the approximation 



term ||/ — ft.|p and the £i-norm \\h\\ 



3.1 Definition of the Lasso estimator 

We consider the generalized linear Gaussian model and the statistical pro- 
blem (2.1) introduced in the last section. Throughout this section, we assume 
that 2?p = {01, . . . , (j)p} is a finite dictionary of size p. In this case, any h in the 
linear span of "Dp has finite £i-norm 



\\h\\c^(v,) ■■= inf <^ ll^lli = ^ |0,| , G MP such that /i = 0.</. i (3.1) 

and thus belongs to Ci{'Dp). We propose to estimate / by a penalized least 
squares estimator as introduced at (2.5) with a penalty pen(/i) proportional to 
||^||£i(x>p)- This estimator is the so-called Lasso estimator fp defined by 

fp ■■= /p(Ap) = argmin 7(^1) + Xp\\h\\c,CDp), (3.2) 
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where Ap > is a regularization parameter and 7(/i) is defined by (2.4). 

Remark 3.1. Let us notice that the general definition (3.2) coincides with the 
usual definition of the Lasso in the particular case of the classical fixed design 
Gaussian regression model presented in Example 2.2, 

Yt = f{xi) + a^i, i = l,...,n. 

Indeed, if we define y — {Yi, . . . , y„) , we have 

7(/i) = -2Y{h) + \\hf = -2{y,h) + \\hf = \\y- hf - \\y\\\ 

so we deduce from (3.2) that the Lasso satisfies 

/p = argmin (||y - /i.jp + Ap||/i!|£^(p^)) . (3.3) 

/iG£i(r>p) 

Let us now consider for all h E Ci{'Dp), O^ := {9 = {9i, . . . ,9p) G W, h — 
9.<t> = J2%i dj (j)j}. Then, we get from (3.1) that 

inf (||y - hf + Ap!|/.||^,(p^)) = inf f\\y- hf + Ap mf \\9\\, 
heCi(Vp) neCi(T>p) \ tffcfc'h 

Therefore, we get from (3.3) that fp = 9p.(j) where 9p = argmiugg^p \\y — 9.(f)\\'^ + 
Ap||0||i, which corresponds to the usual definition of the Lasso estimator for the 
fixed design Gaussian regression models with finite dictionaries of size p (see [3] 
for instance). 

3.2 The £i-oracle inequality 

Let us now state the main result of this section. 
Theorem 3.2. Assume that maxj=i,...^p ll^jll 1^ 1 '^i^d that 

Ap>4£(v/W+l). (3.4) 



Consider the corresponding Lasso estimator fp defined by (3.2). 

Then, there exists an absolute positive constant C such that, for all z > 0, with 

probability larger than 1 — 3.4 e~^, 



wf-ur+KWfpWc.iVp) < c 



inf (11/ -hf^ W\h\\c,iVp)) + Ap£(l + z) 



(3.5) 
Integrating (3.5) with respect to z leads to the following li-oracle type inequality 
in expectation, 



E 



\\f - fpW"^ + K\\fp\\ci(Vp) 



<c 



, inf (11/ -hf + Xp\\h\\c,iVp)) + ApE 



(3.6) 
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This ^i-oraclc type inequality highlights the fact that the Lasso (i.e. the 
"noisy" Lasso) behaves almost as well as the deterministic Lasso provided that 
the regularization parameter Xp is properly chosen. The proof of Theorem 3.2 
is detailed in Section 8 and we refer the reader to Section 7 for the description 
of the key observation that has enabled us to establish it. In a nutshell, the 
basic idea is to view the Lasso as the solution of a penalized least squares model 
selection procedure over a countable collection of models consisting of ^i-balls. 
Inequalities (3.5) and (3.6) are thus deduced from a general model selection 
theorem borrowed from [5] and presented in Section 7 as Theorem 7.1. 

Remark 3.3. 

1. Notice that unlike the sparsity oracle inequalities with £Q-penalty esta- 
blished by many authors ([3], [19], [14] among others), the above result 
does not require any assumption neither on the target function / nor 
on the structure of the variables c/jj of the dictionary Vp, except simple 
normalization that we can always assume by considerating (t>j/\\ (j)j \ \ instead 

of (/)j . 

2. Although such ^i-oracle type inequalities have already been studied by a 
few authors, no such general risk bound has yet been put forward. Indeed, 
Barron and al. [')] have provided a risk bound like (3.6) but they restrict 
to the case of a truncated Lasso estimator under the assumption that the 
target function is bounded by a constant. For their part, RigoUet and 
Tsybakov [Ki] are proposing an oracle inequality for the Lasso similar to 
(3.5) which is valid under the same assumption as the one of Theorem 3.2, 
i.e. simple normalization of the variables of the dictionary, but their bound 
in probability can not be integrated to get an bound in expectation as the 
one we propose at (3.6). Indeed, first notice that the constant measuring 
the level of confidence of their risk bound appears inside the infimum term 
as a multiplicative factor of the £i-norm whereas the constant z measuring 
the level of confidence of our risk bound (3.5) appears as an additive con- 
stant outside the infimum term so that the bound in probability (3.5) can 
easily be integrated with respect to z, which leads to the bound in expec- 
tation (3.6). Besides, the main drawback of the result given by Tsybakov 
and RigoUet is that the lower bound of the regularization parameter Xp 
they propose (i.e. Xp > \/8(l + z/ Inp) e-^/lnp) depends on the level of 
condidence z, with the consequence that their choice of the Lasso estima- 
tor fp = fp{Xp) also depends on this level of confidence. On the contrary, 
our lower bound Xp > 4e(-\/lnp -I- 1) does not depend on z so that we are 
able to get the result (3.5) satisfied with high probability by an estimator 
fp ~ fpi\) independent of the level of confidence of this probability. 

3. Theorem 3.2 is interesting from the point of view of approximation theory. 
Indeed, as we shall see in Proposition 5.6, it shows that the Lasso performs 
as well as the greedy algorithms studied in [1] and [')]. 

4. We can check that the upper bound (3.6) is sharp. Indeed, assume that 
p > 2, that / S Ci{'Dp) with ||/||£i(x) ) < R and that R > e. Consider 
the Lasso estimator fp for Xp = 4e{\/Tnp + 1). Then, by bounding the 



RR n° 7356 



An £i-Oracle Inequality for the Lasso 11 



infimum term in the right-hand side of (3.6) by the value at h = f, we get 
that 

'il/-/pir] <CAp(|l/|U,(p^)+e)<8Ci?£(v/!^+l), (3.7) 

where C > 0. Now, it is estabhshed in Proposition 5 in [2] that there 
exists K > such that the minimax risk over the ^i-balls S^^p ~ {h Cz 
Ci{'Dp),\\h\\c,(Vp} < R} satisfies 



inf sup E \\h-hf > Kuii [Re^yi + \n{peR-^), pe^, R^] , (3.8) 

h hGSR,p L J V / 

where the infimum is taken over all possible estimators h. Comparing 
the upper bound (3.7) to the lower bound (3.8), we see that the ratio 
between them is bounded independently of e for all Sii,p such that the 
signal to noise ratio Re~^ is between ^/\np and p. This proves that the 
Lasso estimator fp is approximately minimax over such sets Sfi^p. 

4 A selected Lasso estimator for infinite coun- 
table dictionaries 

In many applications such as micro-array data analysis or signal reconstruction, 
we are now faced with situations in which the number of variables of the dictio- 
nary is always increasing and can even be infinite. Consequently, it is desirable 
to find competitive estimators for such infinite dimensional problems. Unfor- 
tunately, the Lasso is not well adapted to infinite dictionaries. Indeed, from a 
practical point of view, there is no algorithm to approximate the Lasso solution 
over an infinite dictionary because it is not possible to evaluate the infimum 
of 7(/i) + A||/i||£jCp) over the whole set Ci{'D) for an infinite dictionary V, but 
only over a finite subset of it. Moreover, from a theoretical point of view, it is 
difficult to prove good results on the Lasso for infinite dictionaries, except in 
rare situations when the variables have a specific structure (see Section 6 on 
neural networks). 

In order to deal with an infinite countable dictionary V, one may order the 
variables of the dictionary, write the dictionary V = {(j)j}j^iq* — {01, 02, ... } 
according to this order, then truncate I? at a given level p to get a finite sub- 
dictionary {(pi, . . . , 0p} and finally estimate the target function by the Lasso 
estimator fp over this subdictionary. This procedure implies two difficulties. 
First, one has to put an order on the variables of the dictionary, and then all 
the matter is to decide at which level one should truncate the dictionary to make 
the best tradeoff between approximation and complexity. Here, our purpose is 
to resolve this last dilemma by proposing a selected Lasso estimator based on an 
algorithm choosing automatically the best level of truncation of the dictionary 
once the variables have been ordered. Of course, the algorithm and thus the 
estimation of the target function will depend on which order the variables have 
been classified beforehand. Notice that the classification of the variables can 
reveal to be more or less difficult according to the problem under consideration. 
Nonetheless, there are a few applications where there may be an obvious order 
for the variables, for instance in the case of dictionaries of wavelets. 
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In this section, wc shall first introduce the selected Lasso estimator that we 
propose to approximate the target function in the case of infinite countable dic- 
tionaries. Then, we shall provide an oracle inequality satisfied by this estimator. 
This inequality is to be compared to Theorem 3.2 established for the Lasso in 
the case of finite dictionaries. Its proof is again an application of the general 
model selection Theorem 7.1. Finally, we make a few comments on the possible 
advantage of using this selected Lasso estimator for finite dictionaries in place 
of the classical Lasso estimator. 

4.1 Definition of the selected Lasso estimator 

We still consider the generalized linear Gaussian model and the statistical pro- 
blem (2.1) introduced in Section 2. We recall that, to solve this problem, we 
use a dictionary V = {(t>j}j and seek for an estimator / = 9.(j) ~ ^ ch-e'D ^j ^j 
solution of the penalized risk minimization problem, 

/ e argmin7(ft,) + pen(/i), (4-1) 

heCiiv) 

where pcn(/i) is a suitable positive penalty. Here, we assume that the dictionary 
is infinite countable and that it is ordered, 

^ = {'^jljeN* = {01, 02, ■ • ■ }■ 

Given this order, wc can consider the sequence of truncated dictionaries (Vp) j^, 
where 

2?p:={0i,...,(/)p} (4.2) 

corresponds to the subdictionary of V truncated at level p, and the associated 
sequence of Lasso estimators {fp)peN' defined in Section 3.1, 

fp — /p(Ap) = argmin ■y{h) + \p\\h\\c,(-r>p), (4.3) 

heCiiVp) 

where (Ap) ^^, is a sequence of regularization parameters whose values will 
be specified below. Now, we shall choose a final estimator as an £o-pcnalized 
estimator among a subsequence of the Lasso estimators (/p)pgN* • Let us denote 
by A the set of dyadic integers, 







A^{2\JeN}, 




(4.4) 


and define 






/p = 


= argmin 

pSA 


lUp) + \ll/pll£i(X)p) + Pen(p) 




(4.5) 


= 


= argmin 

pSA 


argmin {l{h) + Xp\\h\\c,(Vp)) 
heCiiVp) 


+ pcn(p) 


(4.6) 



where pen(p) penalizes the size p of the truncated dictionary Vp for all p G A. 
From (4.6) and the fact that CiiV) = UpgA£i(I?p), we see that this selected 
Lasso estimator fp is a penalized least squares estimator solution of (4.1) where, 
for any p S A and h £ £i(2?p), pen(/i) = Ap||/i||£j(x) ) + pen(p) is a combina- 
tion of both ^i-regularization and ^o-penalization. We see from (4.5) that the 
algorithm automatically chooses the rank p so that fp makes the best tradeoff 
between approximation, ^i-regularization and sparsity. 
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Remark 4.1. Notice that from a theoretical point of view, one could have 
defined fp as an £o-penalized estimator among the whole sequence of Lasso esti- 
mators (/p)pGN* (or more generally among any subsequence of (/p)peN*) instead 
of (/p)pGA- Nonetheless, to compute fp efficiently, it is interesting to limit the 
number of computations of the sequence of Lasso estimators fp especially if 
we choose an ^p-penalty pen(p) that does not grow too fast with p, typically 
pen(p) ex Inp, which will be the case in the next theorem. That is why we have 
chosen to consider a dyadic truncation of the dictionary 7). 

4.2 An oracle inequality for the selected Lasso estimator 

By applying the same general model selection theorem (Theorem 7.1) as for 
the establishment of Theorem 3.2, we can provide a risk bound satisfied by 
the estimator fp with properly chosen penalties Xp and pen(p) for all p € A. 
The sequence of £i-regularization parameters (Ap) ^^ is simply chosen from the 
lower bound given by (3.4) while a convenient choice for the io-penalty will be 
pen(p) ex In p. 



Theorem 4.2. Assume that sup.gfj* \\<f>j\\ < 1. Set for all p € A, 

Xp ^ As ( \/lap + l) , pen(p) == 5£^lnp, (4.7) 



and consider the corresponding selected Lasso estimator fp defined by (4.6). 
Then, there exists an absolute constant C > such that 



<C 



\\.f - fpf + ^p\\fp\\c,{Vf) +pen(p) 



i^ Uel^fx,,) (""-^ - '^'l' + WMc.iv^^) + P-W ) + ^' 



(4.8) 



Remark 4.3. Our primary motivation for introducing the selected Lasso esti- 
mator described above was to construct an estimator adapted from the Lasso 
and fitted to solve problems of estimation dealing with infinite dictionaries. 
Nonetheless, we can notice that such a selected Lasso estimator remains well- 
defined and can also be interesting for estimation in the case of finite dictionaries. 
Indeed, let Vp^ be a given finite dictionary of size po- Assume for simplicity that 
"Dpg is of cardinal an integer power of two: po = 2"^°. Instead of working with 
the Lasso estimator defined by 



Jpo 



arg mm 

heCiiVpg 



l{h) + Xp„\\h\\ 



'Ci(X'po), 



with Xp„ = 4e (-y/hTpo -I- l) being chosen from the lower bound of Theorem 3.2, 
one can introduce a sequence of dyadic truncated dictionaries I?i C • • • C Vp C 
• • • C Vp^ , and consider the associated selected Lasso estimator defined by 



fp 



arg mm 

p6Ao 



argmin ['y(h) 

hG£i(X>p) 



Xp\\h\\ 



Ci(Vf,)) 



pen(p) 
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where Aq = {2'\ J = 0, . . . , Jo} and where the sequences Xp — 4e [\/lnp + l) 
and pen(p) = Se^lnp are chosen from Theorem 4.2. The estimator fp can 
be seen as an ^g-penahzed estimator among the sequence of Lasso estimators 
(/p)peAo associated to the truncated dictionaries (Vp) ^^ , 

/p^argmin l{h) + Xp\\h\\c,(v^). 

In particular, notice that the selected Lasso estimator fp and the Lasso estima- 
tor fpg coincide when p ^ po and that in any case the definition of fp guarantees 
that fp makes a better tradeoff between approximation, £i-regularization and 
sparsity than fp^. Furthermore, the risk bound (4.8) remains satisfied by fp for 
a finite dictionary Vp^ if we replace V by Vp^ and A by Aq . 

5 Rates of convergence of the Lasso and selected 
Lasso estimators 

In this section, our purpose is to provide rates of convergence of the Lasso and 
the selected Lasso estimators introduced in Section 3 and Section 4. Since in 
learning theory one has no or not much a priori knowledge of the smoothness of 
the unknown target function / in the Hilbert space H, it is essential to aim at 
establishing performance bounds for a wide range of function classes. Here, we 
shall analyze rates of convergence whenever / belongs to some real interpolation 
space between a subset of £i {T>) and the Hilbert space H. This will provide a 
full range of rates of convergence related to the unknown smoothness of /. In 
particular, we shall prove that both the Lasso and the selected Lasso estimators 
perform as well as the greedy algorithms presented by Barron and al. in [1]. 
Furthermore, we shall check that the selected Lasso estimator is simultaneously 
approximately minimax when the dictionary is an orthonormal basis of H for a 
suitable signal to noise ratio. 

Throughout the section, we keep the same framework as in Section 4.1. 
In particular, V = {0j}jeN* shall be a given infinite countable ordered dic- 
tionary. We consider the sequence of truncated dictionaries (2?p) gp^. defined 

by (4.2), the associated sequence of Lasso estimators {fp)p£w* defined by (4.3) 
and the selected Lasso estimator fp defined by (4.6) with Xp = 4e(-y/lnp+l) and 
pen(p) = 5e^lnp and where A still denotes the set of dyadic integers defined 
by (4.4). 

The rates of convergence for the sequence of the Lasso and the selected Lasso 
estimators will be derived from the oracle inequalities established in Theorem 3.2 
and Theorem 4.2 respectively. We know from Theorem 3.2 that, for all p £ N*, 
the quadratic risk of the Lasso estimator fp is bounded by 



E 



\f-fp 



<C 



heCi{Vp) 



(5.1) 
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where C is an absolute positive constant, while we know from Theorem 4.2 that 
the quadratic risk of the selected Lasso estimator fp is bounded by 



E 



\f-fp 



pf 



<C 






inf ( inf ^\\f-hr + Xr^\h\\^.,m.^)+Vcnip)]+e^ 

pGA \heCi 

(5.2) 
where C is an absolute positive constant. Thus, to bound the quadratic risks of 
the estimators fp and fp for all p G W , we can first focus on bounding for all 

peW, 

inf (11/ - hf + Xp\\h\\c,iv,^) = 11/ - /pip + XpWfpWc.iVp), (5.3) 

n£Ci{T>p) 

where we denote by fp the deterministic Lasso for the truncated dictionary Vp 
defined by 

fp = argmin (||/ - hf + Xp\\h\\c,(Vp)) ■ (5.4) 

h£Ci{Vp) 

This first step will be handled in Section 5.1 by considering suitable interpolation 
spaces. Then in Section 5.2, we shall pass on the rates of convergence of the 
deterministic Lassos to the Lasso and the selected Lasso estimators thanks to 
the upper bounds (5.1) and (5.2). By looking at these upper bounds, we can 
expect the selected Lasso estimator to achieve much better rates of convergence 
than the Lasso estimators. Indeed, for a fixed value of p G N*, we can see that 
the risk of the Lasso estimator fp is roughly of the same order as the rate of 
convergence of the corresponding deterministic Lasso fp, whereas the risk of fp 
is bounded by the infimum over all p G K of penalized rates of convergence of 
the deterministic Lassos fp. 

5.1 Interpolation spaces 

Remember that we are first looking for an upper bound oiinihizc^(Vp) (11/ ~ ^11^ + 
'^pll'^-IUiCD )) for all P G N*. In fact, this quantity is linked to another one in 
the approximation theory, which is the so-called Kj} -functional defined below. 
This link is specified in the following essential lemma. 

Lemma 5.1. Let D he some finite or infinite dictionary. For any A > and 
(5 > 0, consider 

iz,(/,A):= ini {y-hf + X\\h\\c,i,D)) 

heCi{D) 

and the K^ -functional defined by 



KD{f,6):= mi {\\f-h\\+5\\h\\c,^n))- (5.5) 

heCi(Dj 



Then, 



\ ro H^^'^) + w) ^ ^-(^'^) ^ ^0 (^-(^'^) + ^ ) ■ (^-^^ 



Let us now introduce a whole range of interpolation spaces Bq^r that are 
intermediate spaces between subsets of £i(2?) and the Hilbert space H on which 
the Kxi -functionals (and thus the rates of convergence of the deterministic 
Lassos fp) are controlled for all p S N*. 
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Definition 5.2. [Spaces Ci,r and Bq^r] Let R > Q, r > 0, \ < q < 2 and 

a = l/g-l/2. 

We say that a function g belongs to the space Ci,r if there exists C > such 

that for all p (£ N* , there exists gp e Ci{'Dp) such that 

\\9p\\c,(v,) < C 

and 

\\g-gp\\<C\Dp\-^ = Cp-^. (5.7) 

The smallest C such that this holds defines a norm \\g\\ci r on the space £i,r- 
We say that g belongs to Bqj-{R) if, for all S > 0, 

inf {\\g-h\\+d\\h\\c,^,) <Rd^"'. (5.8) 

We say that g G Bq^r if there exists R > such that g G Bq,r{R). In this case, 
the smallest R such that g G Bq^r{R) defines a norm on the space Bq,r and is 
denoted by ||g||e,.,,. 

Remark 5.3. Note that the spaces Ci^r and Bq,r depend on the choice of the 
whole dictionary T) as well as on the way it is ordered, but we shall omit this 
dependence so as to lighten the notations. The set of spaces Ci,r can be seen 
as substitutes for the whole space Ci{T>) that are adapted to the truncation of 
the dictionary. In particular, the spaces Ci^r are smaller than the space Ci{T>) 
and the smaller the value of ?- > 0, the smaller the distinction between them. 
In fact, looking at (5.7), we can see that working with the spaces Ci^r rather 
than Ci(T>) will enable us to have a certain amount of control (measured by the 
parameter r) as regards what happens beyond the levels of truncation. 

Thanks to the property of the interpolation spaces Bq_r and to the equiva- 
lence established in Lemma 5.1 between the rates of convergence of the deter- 
ministic Lassos and the K-d -functional, we are now able to provide the following 
upper bound of the rates of convergence of the deterministic Lassos when the 
target function belongs to some interpolation space Bq^r- 

Lemma 5.4. Let 1 < g < 2, r > and R> 0. Assume that f G Bq^r{R)- 
Then, there exists Cq > depending only on q such that, for all p G N* , 



inf (11/ - hf + Xp\\h\\c,iv,)) < Cq max (r'^X^-'^ , (Rp-^) ^-'' A^'^'-'' ] . 

(5.9) 



heCi(V 



Remark 5.5. [Orthonormal case] Let us point out that the abstract interpo- 
lation spaces Bq^r are in fact natural extensions to non-orthonormal dictionaries 
of function spaces that are commonly studied in statistics to analyze the ap- 
proximation performance of estimators in the orthonormal case, that is to say 
Besov spaces, strong-£q spaces and weak->C^ spaces. More precisely, recall that 
if H denotes a Hilbert space and V = {0j}jgN* is an orthonormal basis of H, 
then, for all r > 0, g > and _R > 0, we say that g = J27Li ^j 'Pj belongs to the 
Besov space B2^^{R) if 



sup J'"^^| <R\ (5.10) 



^SN , ^^j 
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while g is said to belong to Cq{R) il 






(5.11) 



and a slightly weaker condition is that g belongs to wCq{R), that is to say 



sup 77*^1 {|e,. I >^} I <i?«. 



(5.12) 



i=i 



Then, we prove in Section 8 that for all 1 < q < 2 and r > 0, there exists 
Cq^r > depending only on q and r such that the following inclusions of spaces 
hold for all i? > when T) is an orthonormal basis of H: 



CqiR)nBl^iR) C wCq{R) n Bl^iR) C Bq^Cq^rR)- 



(5.13) 



In particular, these inclusions shall turn out to be useful to check the optimality 
of the rates of convergence of the selected Lasso estimator in Section 5.3. 



5.2 Upper bounds of the quadratic risk of the estimators 

The rates of convergence of the deterministic Lassos fp given in Lemma 5.4 can 
now be passed on to the Lasso estimators fp, p G N* , and to the selected Lasso 
estimator fp thanks to the oracle inequalities (5.1) and (5.2) respectively. 

Proposition 5.6. Let l<q<2, r>0 and R> 0. Assume that / e Bq^r{R)- 
Then, there exists Cq > depending only on q such that, for all p (^ N* , 



ifly/hip+1) " <Re-^<p 



-1 ^ ^^ 



/\np + l) , then 



E 



ll/-./pll 



< Cg i?M £ 



/lnp+ 1 



2-9 



(5.14) 



• if Re ^ > p 1 (\/lnp + l) , then 



f-fpf <Cq(Rp 



—r\ 2-q 



/hip+ 1 



4(i-<;) 



(5.15) 



if Re-^ < (\/Inp+ l) ' , then 



E Wf-fpr <Cqs'{^\np+l 



(5.16) 



Proposition 5.7. Let 1 < q < 2 and r > 0. Assume that f E Bq,r{R) with 
R > such that Re^^ > max (e, {Ar)~^q) . 

Then, there exists Cq^r > depending only on q and r such that the quadratic 
risk of fp satisfies 



E 



11/ - UW] < Cq,r R" (£v/ln(i?£-i)) 



2-9 



(5.17) 
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Remark 5.8. 

1. Notice that the assumption Re^^ > max (e, {4:r)^^q) of Proposition 5.7 is 
not restrictive since it only means that we consider non-degenerate situ- 
ations when the signal to noise ratio is large enough, which is the only 
interesting case to use the selected Lasso estimator. Indeed, if Re~^ is 
too small, then the estimator equal to zero will always be better than any 
other non-zero estimators, in particular Lasso estimators. 

2. Proposition 5.7 highlights the fact that the selected Lasso estimator can si- 
multaneously achieve rates of convergence of order (£\/ln(||/||i3^^£~i)j 

for all classes Bq,r without knowing which class contains /. Besides, com- 
paring the upper bound (5.17) to the lower bound (5.20) established in the 
next section for the minimax risk when the dictionary V is an orthonormal 
basis of H and r < 1/g— 1/2, we see that they can match up to a constant 
if the signal to noise ratio is large enough. This proves that the rate of 
convergence (5.17) achieved by fp is optimal. 

3. Analyzing the different results of Proposition 5.6, we can notice that, un- 
like the selected Lasso estimator, the Lasso estimators are not adaptative. 
In particular, comparing (5.14) to (5.17), we see that the Lassos fp are 
likely to achieve the optimal rate of convergence (5.17) only for p large 
enough, more precisely p such that Re~^ < p^^/''{y/\np + 1). For smaller 
values of p, truncating the dictionary at level p affects the rate of conver- 
gence as it is shown at (5.15). The problem is that q and r arc unknown 
since they arc the parameters characterizing the smoothness of the un- 
known target function. Therefore, when one chooses a level p of truncation 
of the dictionary, one does not know if Re~^ < p"^^ ^ '^ {y/ln p + 1) and thus 
if the corresponding Lasso estimator fp has a good rate of convergence. 
When working with the Lassos, the statistician is faced with a dilemma 
since one has to choose p large enough to get an optimal rate of conver- 
gence, but the larger p the less sparse and interpretable the model. The 
advantage of using the selected Lasso estimator rather than the Lassos is 
that, by construction of fp, we are sure to get an estimator making the 
best tradeoff between approximation, £i-regularization and sparsity and 
achieving desirable rates of convergence for any target function belonging 
to some interpolation space Bq_r- 

4. Looking at the different results from (5.14) to (5.17), we can notice that 
the parameter q has much more influence on the rates of convergence than 
the parameter r since the rates are of order depending only on the parame- 
ter q while the dependence on r appears only in the multiplicative factor. 
Nonetheless, note that the smoother the target function with respect to 
the parameter r, the smaller the number of variables necessary to keep 
to get a good rate of convergence for the Lasso estimators. Indeed, on 
the one hand, it is easy to check that Bq^r{R) C Bqy{R) for r > r' > 
which means that the smoothness of / increases with r, while on the other 
hand, ^^''/^(^Inp + 1) increases with respect to r so that the larger r the 
smaller p satisfying the constraint necessary for the Lasso fp to achieve 
the optimal rate of convergence (5.14). 
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5. Proposition 5.6 shows that the Lassos fp perform as well as the greedy 
algorithms studied by Barron and ah in [ 1 ]. Indeed, in the case of the fixed 
design Gaussian regression model introduced in Example 2.2 with a sample 
of size n, we have e = cr/^/n and (5.14) yields that the Lasso estimator fp 

achieves a rate of convergence of order R'^ (n~^ Inp) provided that 

Re~^ is well-chosen, which corresponds to the rate of convergence esta- 
blished by Barron and al. for the greedy algorithms. Similarly to our 
result, Barron and al. need to assume that the dictionary is large enough 
so as to ensure such rates of convergence. In fact, they consider truncated 
dictionaries of size p greater than n^/^^^") with n^/i^^/^ > ||/||g ^. Under 
these assumptions, we recover the upper bound we impose on Re^^ to get 
the rate (5.14). 

Remark 5.9. [Orthonormal case] 

1 . Notice that the rates of convergence provided for the Lasso estimators in 
Proposition 5.6 are a generalization to non-orthonormal dictionaries of the 
well-known performance bounds of soft-thresholding estimators in the or- 
thonormal case. Indeed, when the dictionary V = {4>j}j is an orthonormal 
basis of H, if we set Qp := {0 = {0j)jeN'- , 6* = (6*1, . . . , 6lp, 0, . . . , 0, . . . )} 
and calculate the subdifFercntial of the function 6* e Op i— >■ 7(6'.(/)) + Aj,||0||i, 
where the function 7 is defined by (2.4), we easily get that fp = Op.cf) with 
Qp = {Qp,!-, • • ■ , Op^p, 0, . . . , 0, . . . ) where for aU j = 1, . . . ,p, 

Y{4>j)-\p/2 ify((^,)>Ap/2 = 2£(VEp- 
Y{(j)j) + Xp/2 if r(,^,)<-Ap/2 = 2£(^/E^- 
else, 

where Y is defined by (2.1). Thus, the Lasso estimators fp correspond to 
soft-thresholding estimators with thresholds of order e^/Tap, and Propo- 
sition 5.6 together with the inclusions of spaces (5.13) enable to recover 
the well-known rates of convergence of order {e\/hvpf^'' for such thres- 
holding estimators when the target function belongs to wCq n ^2*^00 {^^^ 
for instance [7] for the establishment of such rates of convergence for esti- 
mators based on wavelet thresholding in the white noise framework). 

2. Let us stress that, in the orthonormal case, since the Lasso estimators fp 
correspond to soft-thresholding estimators with thresholds of order e-y/lnp, 
then the selected Lasso estimator fp can be viewed as a soft-thresholding 
estimator with adapted threshold e-\/lnp. 

5.3 Lower bounds in the orthonormal case 

To complete our study on the rates of convergence, we propose to establish a 
lower bound of the minimax risk in the orthonormal case so as to prove that the 
selected Lasso estimator is simultaneously approximately minimax over spaces 
Cq{R) O B2 ooi^) i'^ the orthonormal case for suitable signal to noise ratio Re~^. 
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Proposition 5.10. Assume that the dictionary V is an orthonormal basis o/H. 
Let l<q<2,0<r< 1/q - 1/2 and R > such that Re^^ > max(c^,u^) 
where 

.:=i-g(^l + l)>0. (5.18) 

Then, there exists an absolute constant n > such that the minimax risk over 
Cq{R) n Bl^{R) satisfies 

2-9 



inf sup E 11/ -/ll^ > nu^--^ R" [e^J\n{Re-^)] , (5.19) 

where the infimum is taken over all possible estimators f. 
Remark 5.11. 

1. Notice that the lower bound (5.19) depends much more on the parame- 
ter q than on the parameter r that only appears as a multiplicative factor 
through the term u. In fact, the assumption / G B2 aoi^) '^ J^^t added 
to the assumption / S Cq{R) in order to control the size of the high-level 
components of / in the orthonormal basis T) (see the proof of Lemma 8.5 
to convince yourself), but this additional parameter of smoothness r > 
can be taken arbitrarily small and has little effect on the minimax risk. 

2. It turns out that the constraint r < 1/q — 1/2 of Proposition 5.10 is 
quite natural. Indeed, assume that r > 1/q — 1/2. Then, on the one 
hand it is easy to check that, for all R > 0, B2^^{R') C Cq{R) with 
R' = (l - 2™)i/'?i? where u is defined by (5.18), and thus Cq{R) n 
B2coiR') = SJ^(i?')- On the other hand, noticing that R' < R, we 
have Bl^iR') C Bl^iR) and thus Cq{R)nBl^{R') c CqiR)nBl^{R). 
Consequently, BI^\R') C Cg{R) (1 Bl^{R) C Bl^{R), and the inter- 
section space Cq{R) fl B2^^{R) is no longer a real intersection between a 
strong-£q space and a Besov space B2 00 but rather a Besov space B2 ^o 
itself. In this case, the lower bound of the minimax risk is known to 
be of order £4r/(2r+i) ^^^^ jj^^^j ^^^ instance), which is no longer of the 
form (5.19). 

Now, we can straightforwardly deduce from (5.13) and Proposition 5.10 the 
following result which proves that the rate of convergence (5.17) achieved by 
the selected Lasso estimator is optimal. 

Proposition 5.12. Assume that the dictionary V is an orthonormal basis o/H. 
Let l<q<2,0<r< 1/q - 1/2 and R > such that Re^^ > max(e^,u^) 
where 

Then, there exists Cq^r > depending only on q and r such that the minimax 
risk over Bq^r{R) satisfies 

2-« 



inf sup E 11/ -/f >CqrR''[e^/\n{Re-^)] , (5.20) 

/ feB,AR) L J ' V J 

where the infimum is taken over all possible estimators f. 
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Remark 5.13. Looking at (5.13), one could have obtained a result similar to 
(5.20) by bounding from below the mininiax risk over w£g(i?)ni3Jo^(i?) instead 
of Cq{R) n B2 oo{R) a-s it is done in Proposition 5.10. We refer the interested 
reader to Theorem 1 in [17] for the establishment of such a result. 

6 The Lasso for uncountable dictionaries : neu- 
ral networks 

In this section, we propose to provide some theoretical results on the perfor- 
mance of the Lasso when considering some particular infinite uncountable dic- 
tionaries such as those used for neural networks in the fixed design Gaussian 
regression models. Of course, there is no algorithm to approximate the Lasso 
solution for infinite dictionaries, so the following results are just to be seen as 
theoretical performance of the Lasso. We shall provide an f i-oracle type ine- 
quality satisfied by the Lasso and deduce rates of convergence of this estimator 
whenever the target function belongs to some interpolation space between £1 (I?) 
and the Hilbert space H = R". These results will again prove that the Lasso 
theoretically performs as well as the greedy algorithms introduced in [1]. 

In the artificial intelligence field, the introduction of artificial neural networks 
have been motivated by the desire to model the human brain by a computer. 
They have been applied successfully to pattern recognition (radar systems, face 
identification...), sequence recognition (gesture, speech...), image analysis, adap- 
tative control, and their study can enable the reconstruction of software agents 
(in computer, video games...) or autonomous robots for instance. Artificial neu- 
ral networks receive a number of input signals and produce an output signal. 
They consist of multiple layers of weighted-sum units, called neurons, which are 
of the type 

(j)a,b- M'^^M, x^x{{a,x)+h), (6.1) 

where a G M'', & G M and x is the Heaviside function x{^) = l{a;>o} or more 
generally a sigmoid function. Here, we shall restrict to the case of % being 
the Heaviside function. In other words, if we consider the infinite uncountable 
dictionary T> = {(fiafi ; a G ^'^^ b £ R}, then a neural network is a real-valued 
function defined on R'^ belonging to the linear span of T>. 

Let us now consider the fixed design Gaussian regression model introduced 
in Example 2.2 with neural network regression function estimators. Given a 
training sequence {{xi,Yi), . . . , (a;„, Yn)}, we assume that Yi = f{xi) + a^i for 
all z = 1 , . . . , n and we study the Lasso estimator over the set of neural network 
regression function estimators in £i(2?), 

/ :- /(A) = argmin \\Y - hf + X\\h\\c,iv), (6.2) 

heCi{V) 

where A > is a regularization parameter, £i(2?) is the linear span of V 
equipped with the £i-norm 

\\h\\c,(v)-=i^H\\0\\i= Y. %A, h = e.<l,= Yl Sa,b(baA (6.3) 
and \\Y — /ijp := J2"=i O^i " ^i^i)) /"■ is the empirical risk of h. 
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6.1 An £i-oracle type inequality 

Despite the fact that the dictionary V for neural networks is infinite uncounta- 
ble, we are able to establish an ^i-oracle type inequality satisfied by the Lasso 
which is similar to the one provided in Theorem 3.2 in the case of a finite dictio- 
nary. This is due to the very particular structure of the dictionary V which is 
only composed of functions derived from the Heaviside function. This property 
enables us to achieve theoretical results without truncating the whole dictionary 
into finite subdictionaries contrary to the study developed in Section 4 where 
we considered arbitrary infinite countable dictionaries (see Remark 8.3 for more 
details). The following i!i-oracle type inequality is once again a direct appli- 
cation of the general model selection Theorem 7.1 already used to prove both 
Theorem 3.2 and Theorem 4.2. 



Theorem 6.1. Assume that 

28 cr 



A> 



^ln((n + 1)^^+1) 



Consider the corresponding Lasso estimator f defined by (6.2). 
Then, there exists an absolute constant C > such that 



E 



\\f-fr+M\f\\c,iv) 



<c 



a 



^^inf^^(||/-.|P + A||.!l,,(,,)+A.^^ 



6.2 Rates of convergence in real interpolation spaces 

We can now deduce theoretical rates of convergence for the Lasso from The- 
orem 6.1. Since we do not truncate the dictionary P, we shall not consider 
the spaces £i,,- and Bq,,. that we introduced in the last section because they 
were adapted to the truncation of the dictionary. Here, we can work with the 
whole space -^1(2?) instead oi Ci^r and the spaces Bq^r will be replaced by bigger 
spaces Bq that are the real interpolation spaces between Ci^D) and H = K". 

Definition 6.2. [Space Bq] LetKq <2, a^ 1/q - 1/2 and R>0. We say 

that a function g belongs to Bq{R) if, for all 5 > 0, 

mi {\\9-h\\+5\\h\\c,(v))<R5^''. (6.4) 

heCi(V) 

We say that g & Bq if there exists R > such that g € Bq{R). In this case, the 
smallest R such that g G Bq{R) defines a norm on the space Bq and is denoted 

by hh,- 

The following proposition shows that the Lasso simultaneously achieves de- 
sirable levels of performance on all classes Bq without knowing which class con- 
tains /. 

Proposition 6.3. Let I < q < 2. 

Assume that f G Bq{R) with R>a [in ((n + lY+'^)] ^ 
Consider the Lasso estimator f defined by (6.2) with 



A=^('y^ln((n + l)'i+i)+4') 
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Then, there exists Cq > depending only on q such that the quadratic risk of f 
satisfies 



E 



11/ -/ll 



< Cq W 



Remark 6.4. Notice that the above rates of convergence are of the same order 
as those provided for the Lasso in Proposition 5.6 for a suitable signal to noise 
ratio in the case of an infinite countable dictionary with e = g j ^fn. Besides, 
we recover the same rates of convergence as those obtained by Barron and al. 
in [1] for the greedy algorithms when considering neural networks. Notice that 
our results can be seen as the analog in the Gaussian framework of their results 
which are valid under the assumption that the output variable Y is bounded 
but not necessarily Gaussian. 



7 A model selection theorem 

Let us end this paper by describing the main idea that has enabled us to establish 
all the oracle inequalities of Theorem 3.2, Theorem 4.2 and Theorem 6.1 as an 
application of a single general model selection theorem, and by presenting this 
general theorem. 

We keep the notations introduced in Section 2. In particular, recall that one 
observes a process (i^(/i))/igH defined by F(/i) = (/, h) + £W{K) for all /i e H, 
where e > is a fixed parameter and W is an isonormal process, and that we 
define 7(/i) := -2Y{h) + \\hf . 

The basic idea is to view the Lasso estimator as the solution of a penalized 
least squares model selection procedure over a properly defined countable col- 
lection of models with £i-penalty. The key observation that enables one to make 
this connection is the simple fact that /liCD) = Uii;>o{''^ ^ 'Ci(I?), ||/i||£i(x)) < 
R}, so that for any finite or infinite given dictionary V, the Lasso / defined by 



/ 



argmin [jih) 
heCi(v) 



M\h\\c,i-D)) 



satisfies 



7(/) + mWcAV) = , inf 7(M + MMc^iv) = mf inf 7W + Ai? 

Then, to obtain a countable collection of models, we just discretize the family 
of ^i-balls {h S CiiV), ||/i|1£iCd) < R} by setting for any integer to > 1, 

and define rh as the smallest integer such that / belongs to Sfm i-e. 
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It is now easy to derive from the definitions of m and / and from the fact that 
^i(^)=U„>i^mthat 



j(f) + Xrne<jif) + X[\\f\\c,iv)+e 

h£Ci{V) 



inf inf {l{h) + X\\h\\c,(v), 

m>l KfiGoTn 



< inf inf ^(h) + Xme + Xe, 

m>l \heS,r, 



Xe 



that is to say 



7(/) + pen(m) < inf I inf j{h) + pen(m) + p 

m>l V/iGSm 



(7.2) 



with pen(r7T,) = Xme and p = Xe. This means that / is equivalent to a p- 
approximate penahzed least squares estimator over the sequence of models given 
by the collection of ^i-balls {Sm, m > 1}. This property will enable us to derive 
£i-oracle type inequalities by applying a general model selection theorem that 
guarantees such inequalities provided that the penalty pen(TO) is large enough. 
This general theorem, stated below as Theorem 7.1, is borrowed from [5] and is 
a restricted version of an even more general model selection theorem that the 
interested reader can find in [15], Theorem 4.18. For the sake of completeness, 
the proof of Theorem 7.1 is recalled in Section 8. 

Theorem 7.1. Let {Sm}meM ^^ ^ countable collection of convex and compact 
subsets of a Hilbert space H. Define, for any m € Ai, 



E 



sup W{h) 
hes„. 



(7.3) 



and consider weights {xm}^^^ 

S := 



such that 



meM 



e 



< CX3. 



Let K > \ and assume that, for any m Cz A4, 



pen(m) > 2Ke (Am + exm + \/ A^exj, 



(7.4) 



Given non negative pm, m € M, define a pm- approximate penalized least squares 
estimator as any f g Sm, rh £ M, such that 



lif) + pen(m) < inf inf 7(^1) + pcn(TO) + p,, 

meM \heSm 

Then, there is a positive constant C{K) such that for all f d M and z > 0, with 
probability larger than 1 — T,e~^, 



||/-/f +pen(TO) 



<C{K) 



inf ( inf ||/ - /ijl^ + pen(m) + p„J + (1 + 2)5" 
meM \hes„ 



(7.5) 
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Integrating this inequality with respect to z leads to the following risk bound 



E 



|l/-/||2 + pcn(m) 



< C{K) 



inf inf || / - /i||^ + peii(TO) + p,„ + (1 + E)e' 

meM \heSm 



(7.6) 



8 Proofs 

8.1 Oracle inequalities 

We first prove the general model selection Theorem 7.1. Its proof is based on 
the concentration inequality for the suprema of Gaussian processes established 
in [•5]. Then, deriving Theorem 3.2, Theorem 4.2 and Theorem 6.1 from The- 
orem 7.1 is an exercise. Indeed, using the key observation that the Lasso and 
the selected Lasso estimators arc approximate penalized least squares estima- 
tors over a collection of €i -balls with a convenient penalty, it only remains to 
determine a lower bound on this penalty to guarantee condition (7.4) and then 
to apply the conclusion of Theorem 7.1. 

8.1.1 Proof of Theorem 7. 1 

Let m Cz A4. Since Sm is assumed to be a convex and compact subset, we can 
consider /„ the projection of / onto Sm, that is the unique element of Sm such 
that 11/ - fm\\ = inf/ies,„ ||/ - h\\. By definition of /, we have 

7(/) + pen(m) < 7(/,„) -|- pen(?7i) + p,„. 

Since \\f\\^ + -f{h) = \\f - h\\^ - 2eW{h), this implies that 

11/ - ff + pcn(m) < 11/ - fmf + 2e (wif) ~ Wifm)) + pcn(m) + p,„. (8.1) 

For all m' G M, let ym' be a positive number whose value will be specified 
below and define for every h S Sm' 



Finally, set 



2Wm'(h) = {\\f-fm\\ + \\f-h\\f + yl,. 

fwih)-wifmy 



(8.2) 



Vm 



hes,„, V Wm'[h) 



Taking these definitions into account, we get from (8.1) that 

11/ - /f + pen(m) < |1/ - fm\? + 2ewm{f)Vm + pcn(m) + p„ 



(8.3) 



The essence of the proof is the control of the random variables Vm' for all 
possible values of m' . To this end, we may use the concentration inequality for 
the suprema of Gaussian processes (see [•")]) which ensures that, given z > 0, for 
all m' e M, 



Vm' > E [Vm'] + ^2Vm'{Xm'+z) < e-^'^'+^\ 



(8.4) 
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where 



sup Var 

hes„,, 



Wjh) - Wif„ 



\\h-U\ 

s'^P — mr 



h£S. 



From (8.2), w„Ah) > (||/- /™|| + \\f-h\\)y,n' > \\h - Uhm', so «,„- < yj 
and summing the inequalities (8.4) over m' €z A4, we get that for every z > 
there is an event O^ with ¥{^1^) > 1 — Se~^ such that on ilz, for ah m' E Ai, 



Vrn' < E [Kn'] + yJ^2{Xrn'+z). 

Let us now bound E [Kn']- We may write 



(8.5) 



^[Vra']< 



s^Phes,^,iW{h)-Wif^>)) 



-E 



iWifra')-W{fm))_ 



inf^g5 Wm'ih) 



inf,igs^, Wm' {h) 
But from the definition of /„< , we have for aU h G 5m' 

2w„Ah)>{\\f-U\ + \\f-f,n'\\f + yl,, 

>\\fm'-fmf+yl' 

>(2/f„, V2y,„ni/,»'-/,n||). 
Hence, on the one hand via (7.3) and recaUing that W is centered, we get 

-sup.gc ,(W^(/i)-W^(/mO) 



(8.6) 



E 



inf /,.g5^^, Wjn' (h) 



<22/-'E 



sup iWih)-Wif,n')) 



2 2/jA™', 



and on the other hand, using the fact that {W{fm') — W{fm)) /\\.fm — fm'W is 
a standard normal variable, we get 



E 



{W{U,) ~ W (/„0) , 



inf,jgs Wm'{h) 



< w"}E 



W{U') - W{U) 



Wlrn Jm'W 



<v-J{^-)-"'- 



Collecting these inequalities, we get from (8.6) that for all m' E Ai, 

E[Vra>]<2A^,y;J + i2nr'/^y;^] . 

Hence, setting S = I (47r)~ ' + y^l . (8.5) implies that on the event flz, for 
all m' G M, 



Vra' < yj 2A,n'yJ + V2^ + {2tt) ^'^ + ypll 



2A,„/j/ } + ^/2i; 



(8.7) 



Given /i' e 1,V^ 



to be chosen later, we now define 



Vl.' = 2K''s' 



7+V5] +K' \~^A,n' + \ K'-h-^A„^> 



ni' 1 V "^"in' 



Vs' 
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With this choice of ym', it is not hard to check that (8.7) warrants that on 
the event fi^, eVm' < K'~ for aU m' e M., which in particular imphcs that 
eKti < K'~ , and we get from (8.3) and (8.2) that 

i|/-/||2 + pen(m) 



< 11/ - /mil' + 2K' WraU) + pcn(m) + p„ 



\\I-frnf+K'- 



ll/-/m|| + ll/-/ll +Vl 



pcn(mj + p„ 



Moreover, using repeatedly the elementary inequalities (a + 6)^ < (1 + &)ci^ + 
(1 + 9~^)b^ or equivalently 2ab < da^ + d~^b^ for various values of > 0, we 
derive that on the one hand 

l/-/™||^ 



ll/-/™|| + ||/-/|| <VA'' ||/-/||^ + 



/K' - 1 



and on the other hand 



K'~'yl < 2K''e' 



e-^A™ + Xrn + Ve-^A,nXrh + B{K' 



27r 



2z 



where B{K') = {K' - 1)"^ + (aK'{K'^ - I' 



-1/2 



Hence, setting A{K') = 1 + K' '" [ y K' — 1 ) , we deduce from 
the event fi,, 



that on 



||/-/||2+pcn(m) 



<A(ii")||/-/m||' + A'' '^1|/-/f + 2A" £ Aa + exa + V^ 



.-2rw2D/rWN / ^ 



+ pen(m) + p„, + 2e'K'^B{K') ( ^ + 22 



or equivalently 



(l - K' '/') 11/ - /||2 + pen(m) - 2K'^e [a^ + ex^ + ^IK. 



'%^rh 



< A{K') 11/ - /™||" + pen(m) + p„, + 2e'B{K') ( ^ + 2z 



Because of condition (7.4) on the penalty function, this implies that 

w-l/2^ 



-'2 r^-1 



(l - K'-'^'J 11/ - /IP +[l- K'^K-'j pen(m) 
< A{K') 11/ - /„,||' + pen(m) + p„, + 2e^B{K') (^ + 2z^ . 

Now choosing K' = /\^/^, we get that 

(l-X-V5)(||/-/l|2+pcn(m)) 
< A{K^/^)\\f - /„J2 + pcn(m) + p„, + 2e''B{K^/^) (^ + 2: 
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So, there exists a positive constant C :~ C{K) depending only on K such that 
for aU z > 0, on the event flz, 



11/ - /IP + pen(m) < C inf (||/ - U\' + pen(m) + p„,) + £^(1 + z) 

\m£M 



which proves (7.5). Integrating this inequahty with respect to z straightfor- 
wardly leads to the risk bound (7.6). D 

8.1.2 Proof of Theorem 3.2 

Fix p eN*. Let A^ = N* and consider the collection of ^i-balls for m G A^ , 

S,n = {he CiiVp), \\h\\ci[Vp) < me} . 

We have noticed at (7.2) that the Lasso estimator fp is a p-approximate penali- 
zed least squares estimator over the sequence {S„i, m > 1} for pen(77i) = XpUie 
and p = XpE. So, it only remains to determine a lower bound on Xp that gua- 
rantees that pen(TO) satisfies condition (7.4). 

Let h e S,n and consider 6 ~ {6i, . . . ,9p) such that h ~ 9.(j) = YTj=i ^j 4>j f^nd 
ll'i||£i(x>p) = ll^lli- The linearity of W implies that 



W{h) = J2 e, W{ct,,) < ^ \e,\ \W{<f>,)\ < me . max \W{cP, 



(8.9) 



From Definition 2.1, Var [W{4>.j)] = E [l^2(0j)] = \\4>j\\^ < 1 for all j = 1, . . . ,p. 
So, the variables W{(j)j) and {—W{(j)j)), j = l,...,p, are 2p centered normal 
variables with variance less than 1 and thus (see Lemma 2.3 in [I ^>] for instance), 



E 



max |W((?!)j)| 



E 



max W{(t)j) V max (-VF(0j)) 
i=i,--,p / \J=i P 



< v/21n(2p) 



Therefore, we deduce from (8.9) that 



A„, := E 



sup W{h) 



< me\j2 ln(2p) < ^Pime f Vlnp + \/ln2) . (8.10) 



Now, choose the weights of the form Xm = l^n where 7 > is specified below. 

Then, E™>i e^"" = V (e^ - 1) := ^1 < +^- 

Defining K = 4^2/5 > 1 and 7 = (1 — ^/\n2)/K, and using the inequality 
2\/ab < ria + ij^^b with 77 = 1/2, we get that 



2Ke (Am, + exm + y/A^exm. ) < Ke ( -A„, + Ax^e 



< Ame^ (\/lnp -I- \/ln2 + K-f 



< Ame' 



(\/lnp- 



< Xpine 
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as soon as 



Ap > 4£ ( Vlnp+1 



.11) 



For such values of Ap, condition (7.4) on the penalty function is satisfied and 
we may apply Theorem 7.1. Taking into account the definition of rh at (7.1) 
and noticing that e^ < Xps/A for Ap satisfying (8.11), we get from (7.5) that 
there exists some C > such that for all z > 0, with probability larger than 
1-E^e-^ > l-3.4e-^ 

ll/^/plI + KWfpWciiVp) 
<C inf ( inf \\f-hf + Xpme]+\pe+{l + z)e^ 

m>l \ll'i||ci(i3p)<™e / 



<c 



inf inf || / - /i|p + Apme + Ape(l + 2) 

™>1 \\\l4ciCDp)<mE I 



.12) 



while the risk bound (7.6) leads to 



E 
<C 

<C 



\f- fpW^ + Apll/plUiCDp) 



inf 



inf 



"i>l \\\h\\ciCDp)<rn.e 



11/ -h\\^ + Xpme + Ap£ + (1 + E^)e2 



inf inf \\ f - h\\'^ + Xpme ] + XpE 

™>1 \ll'i|lci(i3p)<"i£ / 



(8.13) 



Finally, to get the desired bounds (3.5) and (3.6), just notice that for all R > 0, 
by considering m^ — [-R/e] G N*, we have for all g e Ci{'Dp) such that 

Il.g||£i(i5p) < i?, 

inf ( inf \\I-hf + Xpme\<\\!-gf + XpmRe 

"'>1 \||/i|ki(Bp)<rn£ / 



< WI-gf + XpR + XpE, 



so that 



inf inf |l/-;i|l^ + Apm£ 



< inf inf \\f - qf + X„R\ + A^e 

R>0\\\g\\c,iT,p)<R 'I 

= J^L . (11-^ " -911^ + ^phWciVp)) + Ape, 

and combining (8.14) with (8.12) and (8.13) leads to 

ll/-/pll' + Ap||/p||£,(i,p)<C 

and 



5.14) 



inf (11/ - gf + Xp\\g\\c,iVp)) + Ap£(l + z) 

geCi_{Vp) 



E 



ll/-/pll + ^pWfpWciiVp) 



<c 



36£i(I>p) 



where C > is some absolute constant. 



D 
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8.1.3 Proof of Theorem 4.2 

Let A^ = N* X A and consider the set of £i-balls for all {m,p) e M, 

Sm,p ^ {he Ci{Vp), \\h\\ci{Vp) < me} . 
Define m as the smallest integer such that fp belongs to Sm.p, i-e. 

\.fp\\ci{Vp 



(8.15) 



Let a > be a constant to be chosen later. From the definitions of rh, Xp and 
pen(p), and using the fact that for all p E A, y/Tnp < {\np)/\/\n2, we have 

-fifp) + Xpme + apcn{p) < j{fp) + Ap||/p||£^(x,^) + Ape + apcn(p) 

< lifp) + >^p\\fp\\c,{Vp) + 4^2 V^ + 4£^ + Sae^ Inp 
<7(/p) + Ap|j/p||£,(x,,) + 



<l{.fp) + Xp\\fp\\c^iV,) + 



5\/ln2 

4 
5Vhr2 



a j 5£^ Inp + 4e^ 
a ) pen(p) + 4e^. 



Now, if we choose a — I — 4/(5\/hi2) g]0, 1[, we get from the definition of fp 
and the fact that CiiVp) = UmeN* ^™.p^ ^^^^ 



lifp) + Apme + apen(p) < 7(/p) + Ap||/p||£^(i,^) + pen(p) + Ae^ 



< inf 

pGA 

< inf 

pSA 



. i^L ^ (^(^) + ^^plMciiVp)) + pcn(p) 



■4e' 



inf inf j(h) + X„me + pen(n) 



< inf 

(m,p)eM 



inf 7(/i) + Xpfne + pen(p) 



-4e^ 



that is to say 



7(/j5) + pen(TO,p) < inf 

(m,p)gAl 



inf 7(ft,) + pcn(?7i,p) + pp 



with pen(m,p) := Xpine + a pcn(p) and pp :— (1 — a) pen(p) + 4£^. This means 
that fp is equivalent to a pp-approximate penalized least squares estimator over 
the sequence of models {Sm,p, {m,p) & M.}. By applying Theorem 7.1, this 
property will enable us to derive a performance bound satisfied by fp provided 
that pen(?7i,p) is large enough. So, it remains to choose weights Xm,p so that 
condition (7.4) on the penalty function is satisfied with pen(m,p) = Xpme + 
apen(p). 
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Let us choose the weights of the form Xm.p = jrn + /3 hip where 7 > and /3 > 
are numerical constants specified later. Then, 



{m,p)£M 



VmeN* 



.-7™ I I y^e 

vpeA 



-7m 






— /? In p 



-;31n2-' 



< +00. 



(e7-l) (1-2-/3) 
Moreover, for all {m,p) g A^, we can prove similarly as (8.10) that 



E 



sup W{h) 



< v277ie I ylnp 



/In 2 



Now, defining K = 4\/2/5 > 1, 7 = (1 - Vh[2)/K > and /3 = (5a)/(4ii:) > 0, 
and using the inequality 2yab < rja + rj^^h with 77 = 1/2, we have 



^^ ^ 1 ^ni,p I ^"^ni-p 1 \ ^m.p^'^ni^p 



^ ^S i ^ ^m.p I ^XjjipE 



< 4e f my In p + mv In 2 + K^m + A'/? Inp j 

< 4e^ ('to('VW+ 1) + KI3\i\p\ 



< 4e^ TO ( Vlnp + 1 ) + -r Inp 

< XpTTie + a pen(p). 

Thus, condition (7.4) is satisfied and we can apply Theorem 7.1 with pen(7Tt, p) = 
XpHie + a pen(p) and pp ~ (1 — a) pen(p) + 4e^, which leads to the following risk 
bound: 



E 
<C 

<C 



\\f - fpW + ApTOe + a pen(p) 

inf ( inf ||/ — /i|| + Aj,me + pen(p) ) + (5 + E-^./j) £^ 

{m,p)eM \heS,„,p 



inf inf ||/ — /i|| + A^me + pen(p) + e^ 

{m,p)eM \heS,n.p 



(8.16) 



where C > denotes some numerical constant. The infimum of this risk bound 
can easily be extended to infpgA ^^ihGCiiv )• Indeed, let po Cz A and i? > 0, and 
consider TOfl = \R/£^ <= N*. Then for all, g € £1 (Vp^) such that ||.g||£iCDp ) < R, 
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wc have g €E S'mn.poj ^^'^ thus 



mf inf \\f-h\ 

im,p)eM \heS„,,p 



Xptne + pcn(p) 



< !l/-.9ll + Ap,-,mi^e + pcn(po) 

< ||/-.9ll' + Ap„(i? + e)+pen(po) 



< ||/-.9||' + Ap„i?,-' ^ 



1 pen(po) + 4e2 



So, we deduce from (8.16) and (8.17) that there exists C > such that 

E 11/- /p|p + Ap?7i£ + apen(p) 



5.17) 



<C 



<c 



inf I inf I inf \\f-g\Y + \.pR\ + pcn(p) 

M ( ^"^L ,(ll/-.9!l^ + -^pll5ll£i(Pp)) +pcn(p)) +e^ 



(8.18) 



Finahy, let us notice that from the fact that a E ]0, 1[ and from (8.15), we have 
E[||/-/pir + Ap||/p||£,(P,)+pcn(j5) 

11/ - ./piP + Ap||/p||£i(i?^) + Qpen(p) 
\\f-fpf + Xprne + apen{p)\. (8.19) 

Combining (8.18) with (8.19) leads to the resuh. D 



a 

< - E 
a 



8.1.4 Proof of Theorem 6.1 

The proof of Theorem 6.1 is again an application of Theorem 7.1 and it is thus 
very similar to the proof of Theorem 3.2. In particular, it is still based on the key 
idea that the Lasso estimator / is an approximate penalized least squares esti- 
mator over the collection of ^i-balls for m £ W , Sm = {h E Ci{'D), ||/i|J£^(u) < 
ma I yjn\. The main difference is that the dictionary T) considered for Theo- 
rem 3.2 was finite while the dictionary T) = {0a, &, a g M**, 6 € M} is infinite. 
Consequently, we can not use the same tools to check the assumptions of Theo- 
rem 7.1, more precisely to provide an upper bound of E [sup^gg VF(/i)] . Here, 
we shall bound this quantity by using Dudley's criterion (see Theorem 3.18 
in [15] for instance) and wc shall thus first establish an upper bound of the 
i-packing number of T) with respect to ||.|j. 

Definition 8.1. [i-packing numbers] Let t > and let Q be a set of functions 
W^ I— > R. We call t-packing number of Q with respect to \\.\\, and denote by 
N {t,Q,\\.\\), the maximal m G N* such that there exist functions gi, ..., gm G G 
with \\gi — gjW > t for all 1 < i < j < m. 

Lemma 8.2. Let t > 0. Then, the t-packing number ofD with respect to \\.\\ is 
upper bounded by 

^d+l 4 + t 



iV(i,P,||.||)<(n + l)" 



t 
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Proof. The inequality can easily be deduced from the intermediate result (9.10) 
in the proof of Lemma 9.3 in [iS]. We recall here this result. Let Q he & set of 
functions M.'^ n- M. If Q is a. linear vector space of dimension D, then, for every 
i? > and t > 0, the t-packing number of {g e Q, \\g\\ < R} with respect to ||.| 
is upper bounded by 

Nit,{9eg,\\g\\<R},\\.\\)<(^^^^ 

We can apply this result to the linear span oi 4>a,b, that we denote by J^a.b, for 
all a e M'' and 6 G K. From (6.1), we have sup^, |0a,6(a;)| < 1, so ||(?!)a,fc|P = 
Sr=i '^a fe(^i)/"' — 1 s,nd we get that 

^= U i'^-."} ^ U {5 € -F,,b, ||.g|| < 1} , 



with 



N{t,{9e^aM,\\g\\<i},\\.\\)<^^* 



t ' 

for all a e M'^ and 6 G M. To end the proof, just notice that there are at most 
(n + 1)''+^ hyperplanes in R'' separating the points (a;i,...,x„) in different 
ways (see Chapter 9 in [s] for instance), with the result that there are at most 
(n + 1)**+^ ways of selecting (j)a,b hi T) that will be different on the sample 
(cci, . . . , x„). Therefore, we get that for aU i > 0, 

iV(t,P,||.||)<(n + l)'^+ii±^. 

n 

Remark 8.3. Let us point out the fact that we are able to get such an upper 
bound of N (t,T>, ||.||) thanks to the particular structure of the dictionary T). 
Indeed, for ah a S R'^, 6 e R and x e R'', (j)a.b{x) G {0, 1}, but there are at 
most (n + l)'^+^ hyperplanes in M'' separating the observed points (xi, . . . , a;„) in 
different ways, so there are at most (n + 1)''+^ ways of selecting 0a, & € 2? which 
will give different functions on the sample {xi, . . . ,x„). In particular, this pro- 
perty enables us to bound the packing numbers of V without truncation of the 
dictionary. This would not be possible for an arbitrary infinite (countable or un- 
countable) dictionary and truncation of the dictionary into finite subdictionaries 
was necessary to achieve our theoretical results in Section 4 when considering 
an arbitrary infinite countable dictionary. 

The following technical lemma will also be used in the proof of Theorem 6.1. 

Lemma 8.4. 



/V'°G)*-^' 



Proof By integration by parts and by defining u ~ ^21n(l/t), we have 



VMl/i) dt = t^yll-l{l/t) 
'- 
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But, if Z is a standard Gaussian variable, we have 
(•+00 1 

hence the result. 



/27r 



-"'/2 ^y ^ p(^ > 0) < 1, 



D 



Proof of Theorem 6.1 Let us define e ~ a j \fn. Consider the collection of 
£i-balls for m e N*, 

^„ = {/ie£i(P), \\h\\c,ip) <me]. 

We have noticed in Section 7 that the Lasso estimator / is a p-approxiniate 
penalized least squares estimator over the sequence {5*™, to > 1} for pcn(7Ti) ~ 
Xme and p = Xe. So, it only remains to determine a lower bound on A that 
guarantees that pen(TO) satisfies condition (7.4) and to apply the conclusion of 
Theorem 7.1. 

Let h G Sm- From (6.3), for all 5 > 0, there exist coefficients da.b such that 
^ = J2a,b ^a,b 4'a,b and J2a,b \(^a..b\ < me + S . By using the linearity of W, we get 
that 

W{h) = J2 ^a,b W{(j>afi) < sup \Wiq^a,b)\ ^ \da,b\ < (to£ + 6) SUp jT^ (</)«,&) | ■ 



a^b 



a,b 



Then, by Dudley's criterion (see Theorem 3.18 in [! ">] for instance), we have 



E 



sup W{h) 



< (toe + (5) E 



sup \W{(t) a ^b)\ 
a.b 



where a^ = sup<j ,, E [Ty2((/)a,6)] = sup„ ,, \\<i>a,b\? = sup, 
from (6.1). So, 



< 12(to£ + 5) / .J\Ti{N {t,VM\\)) dt, 

a,b{T:=^^lbi^^)/n)< 1 



A,„ < 12(TOe + 6) / v/ln(iV(i,D,||.||)) dt. 
Jo 

Moreover, by using Lemma 8.2 and Lemma 8.4, we get that 

nl 



x/ln(7V(t,2?,||.||)) dt 







' L r 



{n + lY+^ 



4 + t 



dt 



= f Jln((n + l)'i+i) + ln(4 + t)+ln('i') dt 

< ^ln{{n + iy+i)+ I v/W^ dt+ I Win Q 

< ^ln((n + l)'*+i) + VhTs + ^^f. 



dt 
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Thus, 



A„ < 12(me + S) 



^ln{{n+l)d+i) + C 



where C = \/\n5 + y/n e]0, 4[. 



Now, choose the weights of the form Xm ~ "frn. where 7 is a positive numerical 



constant specified below. Then X]rn>i ' 



f/(e^-l) :^ E^ < +00. 



Defining K — 14/13 > 1, 7 = 13(4 — C)/A > 0, and using the inequality 
2\/ab < fja + rj^^b with ?/ = 1/6, we get that 



2Ke ( A,„ + ex,„ + ^/AmSXr, 



<Ke{— Am. + SXmS 



< K{me + 6)e 26 



87 



< 28(m£ + 5)e ( y^ln ((71 + l)'^+i) + C + 4 - C j 

< 28(m£ + 5)e ( y^ln ((n + l)'^+i) + 4 j . 

Since this inequality is true for all (5 > 0, we get when 5 tends to that 



2Ke A 



ijTj, I <^'X"rn 



^Amexm] < 28me2 ./in ((n + l)''+i) +4 < Xme 



as soon as 



A>28e 



^ln((n+l)'^+i)+4 



(8.20) 



For such values of A, condition (7.4) on the penalty function is satisfied and me 
may apply Theorem 7.1 with pen(m) = Xms and p = Xe for all to > 1. Taking 
into account the definition of rh at (7.1) and noticing that e^ < Ae/112 for A 
satisfying (8.20), the risk bound (7.6) leads to 



\\f - ff + M\f\\c,iv) 



<c 
<c 



inf inf |l/-/i||^ + Ame +A£+(1 + S.^)e'' 

m>l \||ft|Ui(B)<me 



inf inf \\f -hf + Xme ] + Xe 



where C > is some absolute constant. We end the proof as the one of Theo- 
rem 3.2. D 

8.2 Rates of convergence 

8.2.1 Proofs of the upper bounds 

We first prove a crucial equivalence between the rates of convergence of the 
deterministic Lassos and Kt> -functionals, which shall able us to provide an 
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upper bound of the rates of convergence of the deterministic Lassos when the 
target function belongs to some interpolation space Bq,r. Then, we shall pass 
on these rates of convergence to the Lasso and the selected Lasso estimators. 
Looking at the proofs of Proposition 5.6 and Proposition 5.7, we can see that 
the rate of convergence of a Lasso estimator is nothing else than the rate of 
convergence of the corresponding deterministic Lasso, whereas we can choose 
the best penalized rate of convergence of all the deterministic Lassos to get 
the rate of convergence of the selected Lasso estimator, which explains why 
this estimator can achieve a much better rate of convergence than any Lasso 
estimator. 



Proof of Lemma 5.1. Let us first prove the right-hand side inequality of (5.6). 
For all h e Ci{D), A > and S > 0,wc have 



A 



A2 



\\f-hr + 5^h\\l^^^ + — < {\\f-h\\+5\\h\\c,^n)) +j^ 



452 



Taking the infimum on all (5 > on both sides, and noticing that the infimum 
on the left-hand side is achieved for S^ = A/ (2||/i||£j^(£i)), we get that 



If - hf + X\\h\\c,^n) < inf 

o>() 



{\\f-h\\+S\\h\\c,,o)Y + ^ 



Then, taking the infimum on all h E Ci{D), we get that 

,^fiA\\f-hr+M\h\\c.^D^) 

heCi(D) 



< inf 

(5>0 



inf 

(5>0 



heCi{D) 



2 1 

4(52 



^^inf^^(||/-/.||+.l.||,,,,)) +A^ 



which proves the right-hand side inequality of (5.6). Let us now prove similarly 
the left-hand side inequality of (5.6). By definition of Lz3(/, A), for all rj > 0, 
there exists h^j such that L^if, A) < ||/ — /i^jp + X\\hjj\\ci{D) < LD{f,X) + rj. 
For all 5 > 0, we have 



kULS) + ^ 



j^^,,.fi\f-H+si\Hu^,,,r+^ 



<{\\f-hj+5\\h,\u,iD)y+ ^' 



2 

2S^ 



< 2 



A2 



{\\f-h,r+s^hji^^)) + — 



Taking the infimum on all (5 > on both sides, and noticing that the infimum 
on the right-hand side is achieved for 6^ = A/ (2\\hri\\c:^{D))i we get that 



inf 

i5>0 



[kUL S) + ^^<2 (11/ - h,f + X\\h^\\c,iD)) < 2 (Loif, X)+v)- 



We get the expected inequality when rj tends to zero. 



n 
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Proof of Lemma 5.4. Let p E W and 5 > 0. Applying (5.6) with D ~ Dp 
and A = Ap, wc have 

So, it just remains to bound K^ (/, 5) when / G Bq^r{R)- Let a := l/g — 1/2. 
By definition of / G Bq,r{R), for all (5 > 0, there exists g G £1^^ such that 
11/ - ffil + dllslki.. < RS^''. So, we have 

||/-5||<i?<52" (8.22) 

and 

\\9\\c,.,.<RS"'~\ (8.23) 

Then, by definition of g G /3i,r, there exists ^p G £i(2?p) such that 

Il5pll£,(i5,) < Il5ll£i,. (8.24) 

and 

||.g-3p||<||5l|£,,.p-^ (8.25) 

Then, we get from (8.22), (8.25) and (8.23) that 

\\f-9p\\<\\f~g\\ + \\g~9p\\<R{S'" + S^'^-'p-^), (8.26) 

and wc deduce from (5.5), (8.26), (8.24) and (8.23) that 

/^p,(/,'5) < 11/ - 5pll + '5 !|.9pk,(i,,) < i? (2,52" + <52-ip-) . 

So, we get from (8.21) that 

<inf(^2E^(4^- + ^--p--) + ^)- 

(8.27) 

Let us now consider So such that SR^Sq " = Ap (4(5o ) , and ^i such that 
2i?2j4a-2p-2r ^ ^2 [ASi^y\ that is to Say So = fAp (4V2i?)"^) ° and 

/ / ^ ,-l\l/(2a) 

(5i = ( App*" (2v2i?j J . We can notice that there exists Cq > depending 

only on q such that d^'^-^ p'^'' < CqS^'' for aUp checking App''(2"+i) > i?, while 
Sf" < CqSt"-^ p-^'' for all p checking App'^^^a+i) ^ j^ Therefore, we deduce 
from (8.27) that there exists Cq > depending only on q such that for all p 
checking App^'^^^+i) > i?, 

< Cq R^str 

<CqR^^xf^ 

= CqR''Xl-'\ (8.28) 
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while for all p checking App''(^°+^^ < R, 



inf (11/ - hf + XpMc.iv,}) < 2i?' (4^1" + Sf'-'p-'n 

heCi{T>p) 

2 r4Q-2 ^-2r 



Xp 



<CgRip-^xl'~^ 
^ C, {Rp-"-)^- X 



2-q 



5.29) 



Inequalities (8.28) and (8.29) can be summarized into the following result: 
inf (11/ -hf + X.p\\h\\c,(v,{) < C,max fi?'A2-? , {Rp-^)^ X^^ 



D 



Proof of Proposition 5.6. From (5.1) and (5.9), we know that there exists 
some constant Cq > depending only on q such that, for all p G N* , the 
quadratic risk of fp is bounded by 



E 



2g ijl-q) 



|/-/p|Pj < CJ max f i?'A2-^ {Rp 

< Cq max (r^XI-" , {Rp^'') ^ A^^ , Ap£ j . 

By remembering that Ap = Ae{\/\ap + 1) and by comparing the three terms 
inside the maximum according to the value of p, we get (5.14), (5.15) and 
(5.16). D 

Proof of Proposition 5.7. From (5.2) and (5.9), we know that there exists 
some constant Cq > Q depending only on q such that the quadratic risk of fp is 
bounded by 



E 



<a 



U-fpf 



inf f max ( i^'A^-? , [Rp-") ^-^ X, ~ 



V 



r" Inp 



2-9 



4(l-g) 

< Cq inf I max | R"^ [e^Jlwp] ' , {Rp-"^)^ (eV^) '"' 
peA\{i} 



e^lnp 



(8.30) 

where we use the fact that for all p > 2, we have Ap = 4e{y/lnp + 1) < 4(1 + 
1 / \/ln2) £ \/\np and e^ < e^ (In p) / In 2 . Wc now choose p such that the two terms 
inside the maximum are approximately of the same order. More precisely, let 
us define 



i^os,{Re-^] 



where [x] denotes the smallest integer greater than x, and Pq^,. := 2'^''''. Since 
we have assumed Re~^ > e, we have pq^r G A \ {1} and we deduce from (8.30) 
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that 



E 



ll/-/pll^ 



2-q - - . iii^ 



< C, max i?« (e VinP^) : (RPg.l) "'" [s./h^') '"' J + e^ Inp,,, . 

(8.31) 

Now, let us give an upper bound of each term of the right-hand side of (8.31) 
by assuming that Re^^ > max (c, {4:r)^^q). First, we have by definition oi pq^r 
that 

2r ' 

Moreover, for all x > 0, ln2 < 2xlnx and by assumption Re^^ > q/{Ar), so 

ln2< fluff) <f ln(i?£-i), 
- 2r \ArJ " 2r ^ ' ' 

and thus we get that 

Inpg.r < -ln(i?e"^). (8.32) 

Then, we deduce from (8.32) that the first term of (8.31) is upper bounded by 
i?« (e^i^^)' ' < R' (^)'"^ (ev/ln(i?e-i))'~' . (8.33) 

For the second term of (8.31), using (8.32), the fact that Pq^r > (^£~^)~; that 
^^ < 2 - g and that ln(i?e-i) > 1, we get that 

4(l-g) 

i(i-i) / r-. \ 2^ 



(i?p,^;) ^-' (e^hr^) '"' <R'ie^-'>U^\n{Re-^ 



<i?«(-j ""' (ev/ln(i?e-i)j . (8.34) 



For the third term of (8.31), we have 

9/2 



9 9/ IX 9 / In [(i?e"')'] \ / ^-, TT\2-9 

2(i?e-i^2 



e^ Inp,,. < ^ e^ In (i?£-i) = f ( .^/p^,,/, ^ ) i?' (eVWife^ 
Now, let us introduce 



lu X 

g : ]0,+oo[n- M, X i-^- . 

X 

It is easy to check that g(x?) < 1/x for all x > 0. Using this property and the 
fact that Re~^ > e, we get that 

^^^pTy^-.9((i?£ ))<^^pT<-, 

and thus 

e'lnpg,. < ^ ('^') i?^' (£Vln(i?e-i))'"'. (8.35) 
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Then, wc deduce from (8.31), (8.33), (8.34) and (8.35) that there exists Cq.r > 
depending only on q and r such that 



E 



/ - fpf] < Cg,r R" {e^\n{Re-^) 



2-9 



n 



Proof of Proposition 6.3. Set e = (i/^/n. From Theorem 6.1, we have 



E 



\\f-f\\ 



<C 



h£Ci(T>) 



i.36) 



where C is an absolute positive constant. Then, if / € Bq{R), we get from (6.4) 
and (5.6) with D = V that 

The infimum on the right-hand side is achieved for 5 — (A/(2_R)) '' " ' and 
the last inequality leads to 

inf_ (11/ - hf + \\\h\\c,^r,j) < 2R^^ ( ^^ "^' 



l-2a 2 4q 

= 2 1 + 2° i?2a + l X2C + 1 

Thus, we deduce from (8.36) that there exists some Cq > Q depending only on q 
such that 



E 



11/ -/ll 



< Cq [mx^-" + \e\ 



<Ca 



<c, 



R''(eUln{{n + lY+^)+4\\ + e^ Uln{{n + If+r) + 4 



2-9 



i?Me-y/ln((?i+l)'^+i)j +e^^ln{{n + iy+i) 

< Cq max j i?? (ey^ln((n+ 1)^^+1) j , e^ ^In ((n + l)''+i) 

< Cq R" (e^\n{{n + lY+^)\ 



5.37) 



2-9 



(8.38) 



where we get (8.37) by using the fact 4 < 5Vln2 < byJh\{{n + lY+^) for 71 > 1 
and d > 1 and (8.38) thanks to the assumption Re^'^ > [in ((n + 1)"^+^)] "" . D 

8.2.2 Proofs of the lower bounds in the orthonormal case 

To prove that the rates of convergence (5.17) achieved by the selected Lasso 
estimator on the classes Bq^r are optimal, we propose to establish a lower bound 
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of the minimax risk over Bq^r when the dictionary is an orthonormal basis of H 
and to check that it is of the same order as the rates (5.17). The first central 
point is to prove Remark 5.5, that is to say the inclusion in the orthonormal 
case of the space wCq{R) n B2 ad^) ii^ the space Bq^r{Cq^rR) for all i? > and 
some Cq^r > depending only on q and r. Taking this inclusion into account, 
we shall then focus on establishing a lower bound of the minimax risk over the 
smaller space C q{R)\^B 2 ooi^): which shall reveal to be an easy task, and which 
entails the same lower bound over the bigger space Bq^r- 

Proof of Remark 5.5. Let i? > 0. The first inclusion comes from the simple 
inclusion Cq{R) C wCq{R). Let us prove the second inclusion here. Assume 
that / e wCq{R) n Bl^{R). For all p S N* and /3 > 0, define 

U.p := argmin (!|/ - hf + P\\h\\c,(v,)) ■ (8.39) 

The proof will be divided in two main parts. First, we shall choose /3 such that 
/p,/3 G £i,r- Secondly, we shall choose p such that ||/ — /p,^|| + (5||/p_^||£j^ < 
Cq^rRS^" for all i5 > 0, some Cq^r > and a = 1/q — 1/2, which shall prove 
that / G Bq^r{Cq,rR)- To establish our results, we shall need an upper bound of 
11/ ^ fp.i^W siiid |l/p,/3||£i(D ). These bounds are provided by Lemma 8.5 stated 
below. 

Let us first choose /3 such that fp^p G ^Ci.r- From Lemma 8.5, we have 

11/ - fp.f3\\ < RiP+iy + ^qR'^/^fS'-"/'. 
Now choose /3 such that R{p + 1)"'' = yC, i?«/2^i-«/2^ that is to say 

__ 2 

/?p = i?(v/^(p+l)'-)"^. (8.40) 

Then, we have 

\\f-fp,pj\ < 2Rip+l)-^ < 2R{2p)-^ = 2'-^Rp-\ (8.41) 

Let us now cheek that fp^fj €z Ci^r- Define 

Cp:=max\2'-^R, max Wfp'^pJcAv,,)] ■ (8-42) 

I p't^N*,p'<p ^ ^ p ' J 

Let p' E N*. By definition of /p'_^ ,, we have fp'^p , G £i(2?p'). Up' < p, then 
we deduce from (8.41) that 

||/p,/3, - .U',0j < \\fp,p, /ll + 11/ - .fp',0j < 2'-'-R {p-^ +?'-"•) < Cpp'-^, 

and we have |l/p'.^ , |l£i(X' ,) l£ Cp by definition of Cp. If p' > p, then Ci(T>p) C 
Ci{Vp,) and fp^p^ G Ci{Vp,) with \\fp,pjct{v^,) < II /p,/Jp II £i(i5p) < Cp and 
ll/p./3p - fp,P,\\ = < Cpp'-\ So, /p,^^ G £1,.. 
Now, it only remains to choose a convenient p G N* so as to prove that 

/e Bq^R)- 

Let us first give an upper bound of ||/p,^y||£i^ for all p G N*. By definition of 
||/p,^p||£i r S'Hd the above upper bounds, we have ||/p,;3p|l£i ^ < Cp. So, we just 
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have to bound Cp. Let p' G W ,p' < p. From Lemma 8.5, we know that there 
exists Cq > depending only on q such that \\fp\i3 ,||£iCd ,) < CqR'^fi 7'^- So, 
we get from (8.40) that 

2 



2(g-l) 
2-9 



^ . 2(,-l)r 

2-9 D ('0^\ 2-g 



and we deduce from (8.42) that 

Cp < max (22-'-i?,Cj^i?(2p)^^5^ j < Cq^rRp^^^^ 
where Cq^r > depends only on q and ?'. Thus, wc have 

\\U,pM,.<Cq^rRp^^^^. (8.43) 

Then, we deduce from (8.41) and (8.43) that for all p eW and 5 > 0, 

inf 11/ - h\\ + ^ll/ill^,^ < 11/ - /p,^J| + 5\\fpj,Jc,.. 

< 2^-''Rp-'' + SCq.rRp'-^"^ . (8.44) 

We now choose p > 2 such that p '' ^ Sp ^-i . More precisely, set p = 2"^ 
where J = [(2 — q){qr)'^^ \og2{6^^)~\ . With this value of p, we get that there 
exists C'gj. > depending only on q and r such that (8.44) is upper bounded by 
C^_^i?<5(2'-'3)/9 = C'q„R5^". This means that / £ ^,,^((7,',, i?), hence (5.13). D 

Lemma 8.5. Assume that the dictionary D is an orthonormal basis of the 
Hilhert space EI and that there exist l<q<2, r>{) and R > such that 
f <E wCq{R) n Bl^{R). For all peW and 13 > 0, define 

fp,p:^ argmin {\\f - h\\' + mWc^iv,)) ■ 

Then, there exists Cq > depending only on q such that for all p G N* and 
/3>0, 

||/p./^IU,(p,)<C7,i?«/3i-« 

and 

11/ - fpA < Rip + ^y + v^i?«/'/3'-'/^ 

The proof of Lemma 8.5 uses the two following technical lemmas. 
Lemma 8.6. For all a = (oi, . . . , Op) € MP and 7 > 0, 

p »7 






\>t} 



dt. 
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Proof. 






7 \ / p 

il{|a,|>t} dt l{|a,|>7} + / ^l{|a,|>i} dt 1 l{|a,|<7} 



P 

p 
p 
p 



rtdt)l{|a,|>,}+N 



|aj| 



t dt 1 



{|aj|<7} 



D 



Lemma 8.7. For all a = (oi, . . . , Op) G M^ and 7 > 0, 

p p p ^+00 



i=i 



j=i 



= l-'7 



Proof. 



P p + 00 P / ^|aj] \ P 

E / l{k.l>*} dt = J2\ dt] 1{|,^|>,} = 5] (|a 



jl~7)l{|a,|>7}- 



n 



Proof of Lemma 8.5. Let denote by {6'j}jeN' the coefRcients of the target 
function / in the basis V = {(j)j}j^fq», f = 6* .(j) = X^isn* ^j 't'j- ^^ introduce 
forallpeN*, 



e^ :- {e = (9,) 



J en* , 



L,. . . ,0p,O, . . . ,0, . . .)}. 



Let P > 0. Since fp,ji e £i(Pp), there exists a unique 0^^*^ G 9p such that 
/p./3 = O^'^ .(j>. Moreover, from (3.1) and using the orthonormahty of the basis 
functions ^j, we have 

QP'^ = argmin {\\9* .^ - e.<j)\\^ + p\\9\\i) = argmin (||r - 0f + (3\\e\\i) . 

(8.45) 
By calculating the subdifferential of the function 6* e Rp H> ||6I* - 6*112 + ^||6'||i, 
we get that the solution of the convex minimization problem (8.45) is 6^'^ = 
, 6lP'^, 0, . . . , 0, . . . ) where for ah j = 1, . . . ,p, 



'1 :■•■:' 



9* - p/2 if 9* > p/2, 
9* + /3/2 if 9* < -/3/2, 
else. 
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Then, we have 

2 



j 



EK*-' 



00 p „ p 

< E ^f +E^fi{i«ii</3/2}+fEi^;ii{i«;i>^/2} ■ (8-46) 

j=p+l j=l j=l 

'' .. ^ ^' .. ' '• .. ' 



(i) {ii) (Hi) 



while 



\\fp.Ac.iv,) = T.K' 



i=i 



E(l^ll-f)i{i«;i>/^/2} 



<E 1^1 |i{i«;i>/3/2} = ("*)■ (8-47) 

Now, since / is assumed to belong to B2 aoi-^)^ ^^ S'^t from (5.10) that (i) is 

bounded by 

00 

Y^ ef <i?2(p + i)-2-. (8.48) 

j=p+i 
Let us now bound (ii) and (in) thanks to the assumption / G wCq{R). By 
applying Lemma 8.6 and Lemma 8.7 with Oj = 6* for all j = l,...,p and 
7 = /3/2, and by using the fact that Yl^^^i l{|e-|>t} < Ljli l{ie*|>t} < R''t~'^ 
for alH > if / e wCq{R), we get that (ii) is bounded by 

E^l^^{|s;i</3/2} - ^E / ^^{|e;i>*} '^^ 

/•<3/2 

Jo 
= 1^ i?9/32-«, (8.49) 

while (iii) is bounded by 

P o P P ^ + 00 

El^ll^{|s;i>/5/2} = 2 E^{i«;i>/3/2} + E / i{ie*i>*} '^^ 

= ^ Rip^-'i. (8.50) 

fl- 1 
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Gathering together (8.47) and (8.50) on the one hand and (8.46), (8.48), (8.49) 
and (8.50) on the other hand, wc get that there exists Cq > depending only 
on q such that 

||/p,0||£,(P,)<C,i?«/3l-« 

and 

11/ - fpA' < i?'(p + 1)-"" + C,i?«/32-'?. 

Finahy, 



11/ - fp,p\\ < ^R^{p + i)-^- + Cqmp^~i < Rip + 1)-'- + yc^i?«/2/3i-«/2. 

n 

Proof of Proposition 5.10. Let us define 

M = ev/uln(i?e-i), p ^ 2-\ d == 2^^, 
with 



J = 

and 



^ log2 {RM- 



K= [q\og.^{RM-^)\. 

Let us first check that M is well-defined and that d < p under the assumptions 
of Proposition 5.10. Under the assumption r < 1/q — 1/2, we have u > 0, and 
since Re~^ > e^ > e, M is well-defined. Moreover, since r < 1/q— 1/2, we have 
(2 — q)/{2r) > q, so it only remains to check that RM^^ > e so as to prove that 
d < p. We shall in fact prove the following stronger result: 

Resuh (0): If R£-^ > max{e'^ , u'^) , then Re-^ / {\n{Re-^)) > u. 

This result indeed implies that, under the assumption Re^^ > max(e^,w^), 



RM-' ^ Re ' ^ =VrF^J , ^/„ \ > e X 1 > e. 

Let us prove Result {(}). Introduce the function 

g : 10, +oo[i-^ R, X i-^ . 

lux 

It is easy to check that g is non-decreasing on [e, -t-oo[ and that g(x^) > x for 
all a; > 0. Now, assume that Re^^ > max(e^, u^). Using the properties of g, we 
deduce that if m > e then Re~^ > u^ > e^ > e and 

g{Re^^) > g{u^) > u. 



ln(i?£-i) 

while if u < e then Re^^ > c^ > e and 
Re-^ 



ln{Re- 



- = 5(i?e-i) > .g(e2) > e > u. 
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hence Result {(}). 

Now, consider the following hypercube 8(p, d, M) defined by 



J2 &i <Pi, {Oi,...,ep)e [0, M]P, Q, = for J > p + 1, ^ l{e,^o} = d 



j=i 



j=i 



M^/?, 0„ (A, . . . ,/3p) e [0, If, ft- = for j > p+ 1, 5^1{ft.^o} = d 

The essence of the proof is just to ckeck that 0(p, d, Af) C Cq{R) D B2oo{R), 
which shall enable us to bound from below the minimax risk over Cq{R) n 
B2 ooiR) by the lower bound of the minimax risk over &{p, d, M) provided in [2]. 

Let h e e(p, d, A/). We write h = YlT=i ^j'^J == ^^ ^i^i ^J'^J- 
00 p p 

^ |0,f = M« j] ft' l{^^^o} < M' E Ifft^o} ^ ^'^'^ ^ ^'^' {RM-y < R'^. 

Thus, /i e £g(i?). 

Let Jo eN*. If Jo > p, then, 

00 00 

j=Jo i=p+i 

Now consider Jg < p. Then, 



j=Jo j=Jo 

< J^M^ E l{A#o} 

< {RM-^f~' M^RM-'y 

< i?2. 

Thus, /i e SJool-R)- Therefore, e{p,d,M) C /:,(i?) n iJj' „o(i?) and 



inf sup E 

/ /G£,(fl)nBJ,^(i?.) 



|/-/!|2 >inf sup E 11/ -/!P 
-' / /ee(p,d,A/) L 



.51) 



Now, from Theorem 5 in [2], we know that the minimax risk over Q {p,d,M) 
satisfies 



inf sup E 
/ f£e{p,d,M) 



I/-/II 



> Kdmin A/^,£^ 1 + In 



> K 



{RM 



-n? 



■min(A//^£2 fl + lnf^ 



> k'R'^M-'^ min M^, eM 1 + In 



5.52) 
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where k > and k' > are absolute constants. Moreover, we have 



(l + ln( 



> eM 1 + In 



{RAI-'] 



1 + hi 



> e^ln 



2{RM-^f 
[RM-^Y - 
{RM-'Y 
{Re-'Y {eM~^Y 



In 2 



M^ + e^ hi 



[eM 



-i\" 



M2--£2in[yln(i?£-i)] 



.53) 



But the assumption Re ^ > max(e^,M^) unphes that (8.53) is greater than 
M-^/2. Indeed, first notice tliat 



M^-^e^ln[.ln(i?.-)]>MV2 ^ j^^^ 



> u. 



5.54) 



and then apply Result (^) above. Thus, we deduce from (8.51), (8.52), (8.53) 
and (8.54) that there exists k" > such that 



inf sup E 

/ /e£,(i?)n6J,^(i?) 



11/ - ff] > k"R1M^^'' = Hi"u^-iR'i {e^/\n{Re-^) 



■I^q 



u 
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